The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
!pip install kaggle
!pwd
!pwd
!ls -l ./kaggle.json
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions files home-credit-default-risk
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
The HomeCredit_columns_description.csv acts as a data dictioanry.
There are 7 different sources of data:
name [ rows cols] MegaBytes
----------------------- ------------------ -------
application_train : [ 307,511, 122]: 158MB
application_test : [ 48,744, 121]: 25MB
bureau : [ 1,716,428, 17] 162MB
bureau_balance : [ 27,299,925, 3]: 358MB
credit_card_balance : [ 3,840,312, 23] 405MB
installments_payments : [ 13,605,401, 8] 690MB
previous_application : [ 1,670,214, 37] 386MB
POS_CASH_balance : [ 10,001,358, 8] 375MB
Create a base directory:
DATA_DIR = "Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIRDATA_DIR = "Data/home-credit-default-risk" #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir DATA_DIR
!ls -l DATA_DIR
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
!pwd
!ls -l $DATA_DIR
#!rm -r DATA_DIR
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
unzippingReq = True #True
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile(f'{DATA_DIR}/home-credit-default-risk.zip', 'r')
# extractall(): Extract all members from the archive to the current working directory. path specifies a different directory to extract to
zip_ref.extractall(f'{DATA_DIR}/')
zip_ref.close()
DATA_DIR = "Data/home-credit-default-risk"
ls -l Data/home-credit-default-risk/application_train.csv
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'credit_card_balance'
# DATA_DIR=f"{DATA_DIR}/home-credit-default-risk/"
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets['credit_card_balance'].shape
DATA_DIR
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
#Visualizing the number of rows and columns in each dataset
pd.options.display.float_format = '{:.4f}'.format
ds_info = pd.DataFrame(columns=['row_count','column_count'],index=ds_names)
for ds_name in ds_names:
ds_info['row_count'][ds_name] = datasets[ds_name].shape[0]
ds_info['column_count'][ds_name] = datasets[ds_name].shape[1]
print(ds_info)
fig = plt.figure(figsize=(30,30))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ylim = [20000000,150]
axes = [ax1,ax2]
for i in range(len(ds_info.columns)):
ds_info.iloc[:,i].plot(kind = 'bar',ax=axes[i],title = ds_info.columns[i],xlabel = 'Dataset name',ylabel='Count',ylim=(0,ylim[i]))
From the graph, it is clear that the dataset with the highest number of rows in the HCDR dataset is 'bureau_balance', with over 27 million rows, but it has the least number of features with only 3. The dataset 'install_payments' has the second-highest number of rows, with over 13 million. On the other hand, the 'application' dataset has the highest number of features, with 121, followed by the 'previous_applications' dataset with 37 features.
from IPython.display import display, HTML
pd.set_option("display.max_rows", None, "display.max_columns", None)
def dataset_summary(df, name):
print(f"Summary of the dataset '{name}':\n")
print("Basic Info:")
print(datasets[name].info(verbose=True, null_counts=True))
print(f"\nDescription of the dataset {name}:\n")
print(display(HTML(np.round(datasets[name].describe(),2).to_html())))
print("\nData Types feature counts:\n",df.dtypes.value_counts())
print("\nDataframe Shape: ", df.shape)
print("\nUnique elements count in each object:")
print(df.select_dtypes('object').apply(pd.Series.nunique, axis = 0))
df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
print(f"\n\nList of Categorical and Numerical(int + float) features of {name}:\n")
for k, v in df_dtypes.items():
print({k.name: v}, '\n')
dataset_summary(datasets['application_train'], 'application_train')
From the dataset description, it is observed that there are features in application train dataset that have negative values, which is unexpected for certain features. Those features are DAYS_BIRTH, DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_ID_PUBLISH, DAYS_LAST_PHONE_CHARGE.
dataset_summary(datasets['application_test'], 'application_test')
Similar to the application train dataset, the test dataset also have negative values for the same features, which is unexpected for certain features.
dataset_summary(datasets['bureau'], 'bureau')
dataset_summary(datasets['bureau_balance'], 'bureau_balance')
dataset_summary(datasets['credit_card_balance'], 'credit_card_balance')
dataset_summary(datasets['installments_payments'], 'installments_payments')
dataset_summary(datasets['previous_application'], 'previous_application')
dataset_summary(datasets['POS_CASH_balance'], 'POS_CASH_balance')
def missing_data_plot(df, name):
percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False).round(2)
missing_count = df.isna().sum().sort_values(ascending = False)
missing_data = pd.concat([percent, missing_count], axis=1, keys=['Percent', "Train Missing Count"])
missing_data=missing_data[missing_data['Percent'] > 0]
print(f"\n The summary of the missing data in the dataset '{name}': \n")
if len(missing_data)==0:
print("No missing Data")
else:
display(HTML(missing_data.to_html())) # display all the rows
return missing_data
app_train_missing_data = missing_data_plot(datasets['application_train'], 'application_train')
app_test_missing_data = missing_data_plot(datasets['application_test'], 'application_test')
bureau_missing_data = missing_data_plot(datasets['bureau'], 'bureau')
bureau_balance_missing_data = missing_data_plot(datasets['bureau_balance'], 'bureau_balance')
credit_card_balance_missing_data = missing_data_plot(datasets['credit_card_balance'], 'credit_card_balance')
installments_payments_missing_data = missing_data_plot(datasets['installments_payments'], 'installments_payments')
prev_app_missing_data = missing_data_plot(datasets['previous_application'], 'previous_application')
pos_cash_bal_missing_data = missing_data_plot(datasets['POS_CASH_balance'], 'POS_CASH_balance')
#Plot to visualize the number of missing features in each table
fig = plt.figure(figsize=(30,30))
plot1 = fig.add_subplot(4,2,1)
plot2 = fig.add_subplot(4,2,2)
plot3 = fig.add_subplot(4,2,3)
plot4 = fig.add_subplot(4,2,4)
plot5 = fig.add_subplot(4,2,5)
plot6 = fig.add_subplot(4,2,6)
plot7 = fig.add_subplot(4,2,7)
axes = [plot1, plot2, plot3, plot4, plot5, plot6, plot7]
all_df = [app_train_missing_data, app_test_missing_data, bureau_missing_data,
credit_card_balance_missing_data, installments_payments_missing_data,
prev_app_missing_data, pos_cash_bal_missing_data]
for i in range(7):
df = all_df[i]
df.loc[(df.Percent > 0.0),'Percent'].plot(kind='bar',ax=axes[i],title=df.columns[1],ylim = (0,100),fontsize=10)
plt.subplots_adjust(hspace=0.6)
plt.show()
The missing data in HCDR was found to be present in several features across the different tables, with some tables having a higher percentage of missing data than others. The missing data in application train dataset ranged from 0.0% to 69.9%
import matplotlib.pyplot as plt
%matplotlib inline
datasets["application_train"]['TARGET'].astype(int).plot.hist();
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
correlations
pos_corr = correlations.tail(10).index.values
neg_corr = correlations.head(10).index.values
# Distribution of top 10 positive correlation variables with respect to TARGET
numVar = pos_corr.shape[0]
plt.figure(figsize=(20,50))
for i,var in enumerate(pos_corr):
plt.subplot(numVar,3,i+1)
sns.histplot(datasets['application_train'], x=var, hue="TARGET", kde=True, bins=50, palette=['navy', 'red'])
plt.subplots_adjust(hspace=0.50)
plt.title(var, fontsize = 10)
plt.tight_layout()
plt.show()
Above plot shows us the histogram of the top 10 positively correlated values to the Target variable. The positive correlation distribution with target in HCDR dataset shows that the variables with the highest positive correlation with the target variable are the ones related to credit bureau and loan history, such as the number of days overdue on past credits, the number of previous loans, and the number of inquiries to the credit bureau.
# Distribution of top 10 negative correlation variables with target of HCDR
numVar = neg_corr.shape[0]
plt.figure(figsize=(20, 50))
for i, col in enumerate(neg_corr):
defaulter = datasets["application_train"].loc[datasets["application_train"]['TARGET'] == 1, col]
non_defaulter = datasets["application_train"].loc[datasets["application_train"]['TARGET'] == 0, col]
mu = np.mean(datasets['application_train'][col])
median = np.median(datasets['application_train'][col])
sigma = np.std(datasets['application_train'][col])
plt.subplot(numVar, 3, i+1)
plot = sns.histplot(data=datasets['application_train'][col], kde=True, bins=50, color='navy')
plt.axvline(mu, color='red', linestyle='dashed', linewidth=1)
plt.axvline(median, color='green', linestyle='dashed', linewidth=1)
plt.subplots_adjust(hspace=0.50)
plt.title(col, fontsize=10)
plt.tight_layout()
plt.show()
Above plot shows us the histogram of the top 10 negatively correlated values to the Target variable. We have used the Kernel Density Estimation feature to visualize the probability distribution of the features. We can see that the distribution of 'EXT_SOURCE_2' and 'EXT_SOURCE_3' are right skewed, whereas 'EXT_SOURCE_1' approximately follows a normal distribution.
numerical_features_app = datasets['application_train'].select_dtypes(include = ['int64','float64']).columns
categorical_features_app = datasets['application_train'].select_dtypes(include = ['object']).columns
dtype_count = pd.DataFrame(index=ds_names,columns=['Categorical','Numerical'])
for ds_name in ds_names:
categorical_features = datasets[ds_name].select_dtypes(include = ['object']).columns
numerical_features = datasets[ds_name].select_dtypes(include = ['int64','float64']).columns
dtype_count['Categorical'][ds_name] = len(categorical_features)
dtype_count['Numerical'][ds_name] = len(numerical_features)
print(dtype_count)
dtype_count['Categorical'].plot(kind='bar',ylim = (0,20),fontsize=12,title='Number of Categorical Features in each table',xlabel = 'Table name')
We see that the application table and previous_application table have the maximum number (16) of Categorical Features. Hence we will have to use One Hot Encoding in order to pass the data to the learning algorithm
dtype_count['Numerical'].plot(kind='bar',ylim = (0,120),fontsize=12,title='Number of Numerical Features in each table',xlabel = 'Table name')
Application table has the maximum number of numerical features, hence easier to work with
#Plot distribution of each numerical input variable
ax,fig = plt.subplots(21,5,figsize=(50,50))
numerical_fea = datasets['application_train'][numerical_features_app]
numerical_fea.loc[:, numerical_fea.columns != 'SK_ID_CURR'].hist(
bins=10, figsize=(50,50),xrot = 45,legend=True,ax=ax)
plt.title('Histogram Plot for numerical features')
plt.show()
The histogram plot of each numerical feature provides a clear understanding of the distribution of the features, such as the range and spread of the values. We can observe the distribution of the features and estimate the location and spread of the data. Additionally, we can identify the outliers and understand whether the distribution is skewed or symmetric. These insights help in understanding the relationship between the features and the target variable and can be useful for feature engineering and model building.
#Distribution of categorical variables
df_categorical = datasets['application_train'][categorical_features_app]
df_categorical['TARGET'] = datasets['application_train']['TARGET']
df_categorical['TARGET'].replace(0, 'Non-Defaulter', inplace=True)
df_categorical['TARGET'].replace(1, 'Defaulter', inplace=True)
num_cols = 2
num_rows = int(len(categorical_features_app) / num_cols)
fig, ax = plt.subplots(num_rows, num_cols, figsize=(20, 50))
col = 0
for i in range(num_rows):
for j in range(num_cols):
if col < len(categorical_features_app):
plot = sns.countplot(x=categorical_features_app[col],
data=df_categorical, hue='TARGET', ax=ax[i][j], palette='GnBu')
plot.set_title(f"Distribution of the {categorical_features_app[col]} variable.")
plot.set_xticklabels(plot.get_xticklabels(), rotation=90)
plt.subplots_adjust(hspace=0.45)
col += 1
plt.tight_layout()
plt.show()
The code plots a countplot for each categorical feature in the HCDR application dataset, showing the distribution of the target variable (Defaulter or Non-Defaulter) for each category. This allows us to see if certain categories have a higher proportion of defaulters, which can be useful in identifying potential risk factors for loan default. The code also sets the x-axis tick labels to be rotated 90 degrees to make them easier to read. Overall, the countplots provide a useful visualization of the distribution of categorical variables in the dataset.
#Pairplot
run = True
if run:
df_name = 'application_train'
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED',
'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'AMT_GOODS_PRICE','REGION_RATING_CLIENT','OWN_CAR_AGE']
df = datasets[df_name].copy()
num_df = df[num_attribs]
# Pair-plot
num_df['TARGET'].replace(0, "No Default", inplace=True)
num_df['TARGET'].replace(1, "Default", inplace=True)
sns.pairplot(num_df, hue="TARGET", markers=["s", "o"])
# numerical_features_app = datasets['application_train'].select_dtypes(include = ['int64','float64']).columns
# df_numerical = datasets['application_train'][numerical_features_app]
# sns.pairplot(df_numerical)
# plt.show()
Some observations that can be made from the pairplot include:
x = datasets['application_train']
# plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25)
plt.hist([x[x['TARGET'] == 1]['DAYS_BIRTH'] / -365,x[x['TARGET'] == 0]['DAYS_BIRTH'] / -365], edgecolor = 'k', bins = 25,color=['b','g'],label=['default','non-default'])
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
Based on this plot, it appears that the majority of defaulters are between the ages of 25 to 40, while non-defaulters are more evenly distributed across age ranges.
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"]);
plt.title('Applicants Occupation');
plt.xticks(rotation=90);
The distribution of applicants' occupation shows that the majority of the applicants are laborers, followed by sales staff and core staff.
#plt.hist(datasets["application_train"]['CODE_GENDER'] , edgecolor = 'k',stacked=True)
plt.hist([x[x['TARGET'] == 1]['CODE_GENDER'] ,x[x['TARGET'] == 0]['CODE_GENDER']], edgecolor = 'k', bins = 2,color=['g','r'],label=['default','non-default'])
plt.title('Applicants Gender'); plt.xlabel('Gender(M/F)'); plt.ylabel('Count');plt.xlim(0,10);
Here we notice that number of Female applicants who are defaulters and is slightly higher than Male applicants.
plt.hist(datasets["application_train"]['NAME_CONTRACT_TYPE'] , edgecolor = 'k')
plt.title('Types of Loans'); plt.xlabel('Loan Types'); plt.ylabel('Count');
plt.hist(datasets["application_train"]['FLAG_OWN_CAR'] , edgecolor = 'k')
plt.title('Car Ownership'); plt.xlabel('Applicant Owns Car (Y|N)'); plt.ylabel('Count');
list(datasets.keys())
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
# is there an overlap between the test and train customers
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
#
datasets["application_test"].shape
datasets["application_train"].shape
The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.
appsDF = datasets["previous_application"]
display(appsDF.head())
print(f"{appsDF.shape[0]:,} rows, {appsDF.shape[1]:,} columns")
print(f"There are {appsDF.shape[0]:,} previous applications")
#Find the intersection of two arrays.
print(f'Number of train applicants with previous applications is {len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_train"]["SK_ID_CURR"])):,}')
#Find the intersection of two arrays.
print(f'Number of train applicants with previous applications is {len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])):,}')
# How many previous applciations per applicant in the previous_application
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
len(prevAppCounts[prevAppCounts >40]) #more than 40 previous applications
plt.hist(prevAppCounts[prevAppCounts>=0], bins=100)
plt.grid()
prevAppCounts[prevAppCounts >50].plot(kind='bar')
plt.xticks(rotation=90)
plt.show()
sum(appsDF['SK_ID_CURR'].value_counts()==1)
plt.hist(appsDF['SK_ID_CURR'].value_counts(), cumulative =True, bins = 100);
plt.grid()
plt.ylabel('cumulative number of IDs')
plt.xlabel('Number of previous applications per ID')
plt.title('Histogram of Number of previous applications for an ID')
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)
apps_all = appsDF['SK_ID_CURR'].nunique()
apps_5plus = appsDF['SK_ID_CURR'].value_counts()>=5
apps_40plus = appsDF['SK_ID_CURR'].value_counts()>=40
print('Percentage with 10 or more previous apps:', np.round(100.*(sum(apps_5plus)/apps_all),5))
print('Percentage with 40 or more previous apps:', np.round(100.*(sum(apps_40plus)/apps_all),5))
# This is formatted as code
In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?
previous_application with application_x¶We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.
Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:
AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).
When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]I want you to think about this section and build on this.
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset)), thereby leading to X_train, y_train, X_valid, etc.import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[np.nan, np.nan, np.nan]],
columns=['A', 'B', 'C'])
display(df)
df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
# A B
#max NaN 8.0
#min 1.0 2.0
#sum 12.0 NaN
df = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [1, 2, 3, 4],
'C': np.random.randn(4)})
display(df)
# group by column A:
df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})
# B C
# min max sum
#A
#1 1 2 0.590716
#2 3 4 0.704907
appsDF.columns
funcs = ["a","b","c"]
{f:f"{f}_max" for f in funcs}
So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.
Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.
You must use the following operators with pandas:
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]["AMT_CREDIT"]
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704) & ~(appsDF["AMT_CREDIT"]==1.0)]
appsDF.isna().sum()
appsDF.columns
appsDF[agg_op_features].head()
The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” to the groupby operation.
import pandas as pd
import dateutil
# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)
data.groupby('month', as_index=False).agg({"duration": "sum"})
Pandas reset_index() to convert Multi-Index to Columns
We can simplify the multi-index dataframe using reset_index() function in Pandas. By default, Pandas reset_index() converts the indices to columns.
Since we have both the variable name and the operation performed in two rows in the Multi-Index dataframe, we can use that and name our new columns correctly.
For more details unstacking groupby results and examples please see here
For more details and examples please see here
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
print(f"{appsDF[features].describe()}")
agg_ops = ["min", "max", "mean"]
result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg("mean") #group by ID
display(result.head())
print("-"*50)
result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg({'AMT_ANNUITY' : agg_ops, 'AMT_APPLICATION' : agg_ops})
result.columns = result.columns.map('_'.join)
display(result)
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
print(f"result.shape: {result.shape}")
result[0:10]
result.isna().sum()
# This is formatted as code
# Create aggregate features (via pipeline)
class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None): # no *args or **kargs
self.features = features
self.agg_op_features = {}
self.agg_name_op_features = {}
for f in features:
self.agg_name_op_features[f] = {f"{f}_{func}":func for func in ["min", "max", "mean"]}
self.agg_op_features[f] = {np.min,np.max,np.mean}
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
col=[]
for i in self.agg_name_op_features.keys():
for j in self.agg_name_op_features[i]:
col.append(j)
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
result.columns = col
result = result.reset_index(level=["SK_ID_CURR"])
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
return result
from sklearn.pipeline import make_pipeline
def test_driver_prevAppsFeaturesAggregater(df, features):
print(f"df.shape: {df.shape}\n")
print(f"df[{features}][0:5]: \n{df[features][0:5]}")
test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
return(test_pipeline.fit_transform(df))
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
features = ['AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CNT_PAYMENT',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION']
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
print(f"HELLO")
print(f"Test driver: \n{res[0:10]}")
print(f"input[features][0:10]: \n{appsDF[0:10]}")
# QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
res.head()
~3==3
datasets.keys()
## Transform all secondary tables
class feature_Aggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None,sk_id=None):
self.features = features
self.sk_id = sk_id
self.agg_op_features = {}
self.agg_name_op_features = {}
for f in features:
self.agg_name_op_features[f] = {f"{f}_{func}":func for func in ["min", "max", "mean"]}
self.agg_op_features[f] = {np.min,np.max,np.mean}
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
col=[]
for i in self.agg_name_op_features.keys():
for j in self.agg_name_op_features[i]:
col.append(j)
result = X.groupby([self.sk_id]).agg(self.agg_op_features)
result.columns = col
result = result.reset_index(level=[self.sk_id])
return result
## Previous App
prevAppsfeatures = ['AMT_ANNUITY', 'AMT_APPLICATION','AMT_CREDIT']
prevApps_feature_pipeline = Pipeline([('prevApps_aggregater', feature_Aggregater(features,'SK_ID_CURR'))])
prevAppsDF = datasets['previous_application']
## POS_CASH_balance
POS_CASH_balance_features = ['MONTHS_BALANCE','CNT_INSTALMENT','CNT_INSTALMENT_FUTURE']
POS_CASH_balance_pipeline = Pipeline([('POS_CASH_balance_aggregater', feature_Aggregater(POS_CASH_balance_features,'SK_ID_CURR'))])
POS_CASH_balanceDF = datasets['POS_CASH_balance']
POS_CASH_balanceDF.dropna(inplace=True)
## installments_payments
installments_payments_features = ['DAYS_INSTALMENT','AMT_INSTALMENT']
installments_payments_pipeline = Pipeline([('installments_payments_aggregater', feature_Aggregater(installments_payments_features,'SK_ID_CURR'))])
installments_paymentsDF = datasets['installments_payments']
installments_paymentsDF.dropna(inplace=True)
## credit_card_balance
credit_card_balance_features = ['AMT_BALANCE','AMT_DRAWINGS_CURRENT']
credit_card_balance_pipeline = Pipeline([('credit_card_balance_aggregater', feature_Aggregater(credit_card_balance_features,'SK_ID_CURR'))])
credit_card_balanceDF = datasets['credit_card_balance']
credit_card_balanceDF.dropna(inplace=True)
## bureau_balance
bureau_balance_features = ['MONTHS_BALANCE']
bureau_balance_pipeline = Pipeline([('bureau_balance_aggregater', feature_Aggregater(bureau_balance_features,'SK_ID_BUREAU'))])
bureau_balanceDF = datasets['bureau_balance']
bureau_balanceDF.dropna(inplace=True)
## bureau
bureau_features = ['AMT_CREDIT_SUM']
bureau_pipeline = Pipeline([('bureau_aggregater', feature_Aggregater(bureau_features,'SK_ID_CURR'))])
bureauDF = datasets['bureau']
bureauDF.dropna(inplace=True)
X_train= datasets["application_train"] #primary dataset
print(X_train.shape)
appsDF = datasets["previous_application"] #prev app
merge_all_data = True
if merge_all_data:
prevApps_aggregated = prevApps_feature_pipeline.transform(appsDF)
POS_CASH_balance_aggregated = POS_CASH_balance_pipeline.transform(POS_CASH_balanceDF)
installments_payments_aggregated = installments_payments_pipeline.transform(installments_paymentsDF)
print(installments_payments_aggregated.columns)
credit_card_balance_aggregated = credit_card_balance_pipeline.transform(credit_card_balanceDF)
bureau_balance_aggregated = bureau_balance_pipeline.transform(bureau_balanceDF)
bureau_aggregated = bureau_pipeline.transform(bureauDF)
bureauDF = bureauDF.merge(bureau_balance_aggregated, how='left', on="SK_ID_BUREAU")
bureauDF = bureauDF.merge(bureau_aggregated, how='left', on="SK_ID_CURR")
X_train = X_train.merge(prevApps_aggregated, how='left', on="SK_ID_CURR")
X_train = X_train.merge(POS_CASH_balance_aggregated, how='left', on="SK_ID_CURR")
X_train = X_train.merge(installments_payments_aggregated, how='left', on="SK_ID_CURR")
X_train = X_train.merge(credit_card_balance_aggregated, how='left', on="SK_ID_CURR")
X_train = X_train.merge(bureauDF, how='left', on="SK_ID_CURR")
X_train.shape
with open("X_train_before", "w") as f:
X_train.to_csv(f, index=False)
f.close()
import pandas as pd
csv_file = 'X_train_before'
# Read the CSV file into a pandas DataFrame
#X_train = pd.read_csv(csv_file)
X_train.shape
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
prevAppCounts_df = prevAppCounts.to_frame()
# Frequency feature
prevAppCounts_df.columns = ['Count']
prevAppCounts_df['SK_ID_CURR'] = prevAppCounts.index
# Monetary feature
prevAppCounts_df['AverageAppAmt'] = appsDF.groupby(['SK_ID_CURR'])['AMT_APPLICATION'].mean()
# Recency feature
prevAppCounts_df['AVG_DAYS_BETWEEN_PAYMENTS'] = installments_paymentsDF.groupby('SK_ID_CURR')['DAYS_INSTALMENT'].diff().mean(level=0)
prevAppCounts_df['AVG_DAYS_BETWEEN_PAYMENTS'] = np.where(prevAppCounts_df['AVG_DAYS_BETWEEN_PAYMENTS'].isnull(), 0,
prevAppCounts_df['AVG_DAYS_BETWEEN_PAYMENTS'])
X_train = X_train.merge(prevAppCounts_df,how='left',on='SK_ID_CURR')
X_train.shape
# Drop columns with more than 50% missing values
# threshold = 0.9
# null_counts = X_train.isnull().sum() / len(df)
# drop_cols = null_counts[null_counts > threshold].index
# X_train = X_train.drop(drop_cols, axis=1)
def drop_missing_cols(df, drop_percentage):
percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False).round(2)
missing_cols = percent[percent > drop_percentage].index.tolist()
df.drop(columns=missing_cols, inplace=True)
print(missing_cols)
return df
X_train = drop_missing_cols(X_train, 90)
X_train.shape
# dropping one of the highly correlated features:
# Create a correlation matrix
import numpy as np
def drop_highly_correlated_features(df, corr_coeff):
corr_matrix = df.corr().abs()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
# pairs of highly correlated features (correlation coefficient >= 0.995)
corr_pairs = [(i, j) for i in range(len(corr_matrix.columns)) for j in range(i+1, len(corr_matrix.columns)) if mask[i,j] and corr_matrix.iloc[i,j] >= corr_coeff]
for pair in corr_pairs:
print(f"Highly correlated features: {corr_matrix.columns[pair[0]]}, {corr_matrix.columns[pair[1]]}")
print(f"Correlation coefficient: {corr_matrix.iloc[pair[0], pair[1]]}\n")
# Dropping one of the features from each highly correlated pair
for pair in corr_pairs:
feature1 = corr_matrix.columns[pair[0]]
feature2 = corr_matrix.columns[pair[1]]
if feature1 in df.columns and feature2 in df.columns:
if df[feature1].dtype != 'object' and df[feature2].dtype != 'object': # Make sure both features are numeric
if abs(df[feature1].corr(df['TARGET'])) >= abs(df[feature2].corr(df['TARGET'])):
df.drop(columns=[feature1], inplace=True)
else:
df.drop(columns=[feature2], inplace=True)
return df
X_train = drop_highly_correlated_features(X_train, 0.995)
X_train.shape
#X_train.to_csv("X_train_after",index=False)
with open("X_train_final", "w") as f:
X_train.to_csv(f, index=False)
f.close()
import pandas as pd
csv_file = 'X_train_final'
# Read the CSV file into a pandas DataFrame
X_train = pd.read_csv(csv_file)
X_train.shape
df=X_train
num_duplicates = X_train.duplicated().sum()
# Print the number of duplicate rows
print("Number of duplicate rows:", num_duplicates)
X_kaggle_test= datasets["application_test"]
if merge_all_data:
# 1. Join/Merge in prevApps Data
X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
#Since the PrevAppsDF is already merged with other secondary tables, we are only merging prevAppsDF with X_kaggle_test
# Convert categorical features to numerical approximations (via pipeline)
class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
'1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
return X
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import mean_squared_error,f1_score, roc_auc_score
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
#logistic regssion import statement
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from time import time
Splitting the data
X_train.columns
#df = datasets["application_train"]
df = X_train
#input_f = datasets["application_train"]
X = df.drop('TARGET', axis=1) # Features
y = df['TARGET'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42) # Split data into train and test sets
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,
test_size=0.2, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
#X_train_resampled, y_train_resampled = resample(X_train[y_train == 0], y_train[y_train == 0], n_samples=len(X_train[y_train == 1]), random_state=42)
#X_train = pd.concat([X_train[y_train == 1], X_train_resampled])
#y_train = pd.concat([y_train[y_train == 1], y_train_resampled])
Numberical Features
Identify the numeric features we wish to consider
#getting the numerical features from the application train table which contains int and float datatypes.
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_features.remove("TARGET")
numerical_features
len(numerical_features)
Categorical Features
Identify the categorical features we wish to consider
#getting the categorical features from the application train table which contains object datatypes.
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
categorical_features
len(categorical_features)
from sklearn.base import BaseEstimator, TransformerMixin
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
Numerical Pipeline Defination
# Create numerical pipeline
numerical_pipeline = Pipeline([
('selector', DataFrameSelector(numerical_features)),
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
Categorical Pipeline Definition
OHE when previously unseen unique values in the test/validation set
Train, validation and Test sets (and the leakage problem we have mentioned previously):
Let's look at a small usecase to tell us how to deal with this:
ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.
# Create categorical pipeline
categorical_pipeline = Pipeline([
('selector', DataFrameSelector(categorical_features)),
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown= 'ignore', sparse=False))
])
Data Preparation Pipeline
with columnTransformer() we combine both numerical and categorical feature pipelines to form a pipeline
# Create column transformer
column_transformer_pipeline = ColumnTransformer(
transformers=[
('num_pipeline', numerical_pipeline),
('cat_pipeline', categorical_pipeline)
], remainder="passthrough")
from sklearn.pipeline import FeatureUnion
data_prep_pipeline_FU = FeatureUnion(transformer_list=[
("num_pipeline", numerical_pipeline),
("cat_pipeline", categorical_pipeline),
])
Number of selected features are
selected_features = numerical_features + categorical_features
tot_features = f"{len(selected_features)}: Num:{len(numerical_features)}, Cat:{len(categorical_features)}"
#Total Feature selected for processing
tot_features
# Set feature selection settings
# Features removed each step
feature_selection_steps=10
# Number of features used
features_used=len(selected_features)
To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model
def pct(x):
return round(100*x,3)
try:
experimentLog
except NameError:
experimentLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"Train F1 Score",
"Valid F1 Score",
"Test F1 Score",
"Train Log Loss",
"Valid Log Loss",
"Test Log Loss",
"Train Time",
"Valid Time",
"Test Time",
"Description"
])
Defining Pipeline
Logistic Regression
%%time
np.random.seed(42)
logistic_pipeline = Pipeline([
("preparation", column_transformer_pipeline),# combination of numerical, categorical subpipelines
#("clf", MultinomialNB()) # classifier estimator you are using
("logistic_regression", LogisticRegression() )
])
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_selection import RFE
classifiers = [ [('Logistic Regression', LogisticRegression(solver='saga', random_state=42), "RFE")],
[('Support Vector Machine', SVC(random_state=42, probability=True),"SVM")],
[('Decision Tree', DecisionTreeClassifier(random_state=42), "RFE")],
[('Random Forest', RandomForestClassifier(random_state=42), "RFE")],
[('Gradient Boosting', GradientBoostingClassifier(warm_start=True, random_state=42), "RFE")]
]
Perform cross-fold validation and Train the model Split the training data to 15 fold to perform Crossfold validation
from sklearn.model_selection import ShuffleSplit, cross_validate
cvSplits = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
start = time()
logistic_pipeline.fit(X_train, y_train)
duration_LR_train = np.round((time() - start), 4)
np.random.seed(42)
# Time and score for valid predictions
start = time()
logit_score_valid = logistic_pipeline.score(X_valid, y_valid)
valid_time = np.round(time() - start, 4)
# Time and score for test predictions
start = time()
logit_score_test = logistic_pipeline.score(X_test, y_test)
test_time = np.round(time() - start, 4)
# Time and score for train predictions
start = time()
logit_score_train = logistic_pipeline.score(X_train, y_train)
train_time = np.round(time() - start, 4)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
# obtained predicted probabilities or scores for train
predicted_probs = logistic_pipeline.predict_proba(X_train)[:, 1]
true_labels = y_train
# Calculate the AUC for train
auc_train = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for train
logloss_train = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for test predictions
predicted_probs = logistic_pipeline.predict_proba(X_test)[:, 1]
true_labels = y_test
# Calculate the AUC for test
auc_test = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for test
logloss_test = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for valid predictions
predicted_probs = logistic_pipeline.predict_proba(X_valid)[:, 1]
true_labels = y_valid
# Calculate the AUC for test
auc_valid = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for valid
logloss_valid = log_loss(true_labels, predicted_probs)
#F1-Score for test
y_test_pred = logistic_pipeline.predict(X_test)
f1_test = f1_score(y_test, y_test_pred)
print("F1-Score for Logistic Regression is: " , np.round(f1_test, 4))
#F1-Score for train
y_train_pred = logistic_pipeline.predict(X_train)
f1_train = f1_score(y_train, y_train_pred)
#F1-Score for valid
y_valid_pred = logistic_pipeline.predict(X_valid)
f1_valid = f1_score(y_valid, y_valid_pred)
exp_name = f"Baseline_{len(selected_features)}_features"
experimentLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"Train F1 Score",
"Valid F1 Score",
"Test F1 Score",
"Train Log Loss",
"Valid Log Loss",
"Test Log Loss",
"Train Time",
"Valid Time",
"Test Time",
"Description"
])
experimentLog.loc[len(experimentLog)] = [f"{exp_name}"],logit_score_train, logit_score_valid,logit_score_test,auc_train,auc_test,auc_valid,f1_train,f1_valid,f1_test,logloss_test,logloss_train,logloss_valid,train_time,valid_time,test_time, f"Imbalanced Logistic reg features {tot_features} with 20% training data"
#experimentLog.loc[len(experimentLog)] = ["Logistic Regression as Baseline", "HCDR", f"{f(accuracy_train_LR)}%", f"{f(test_accuracy_LR)}%", f"{f(accuracy_valid_LR)}%", f"{f(duration_LR_train)}%", test_time, (f1_LR), f"{f(accuracy_for_LR)}%", f"{f(roc_auc)}%","Logistic Regression with numerical and categorical pipeline "]
experimentLog
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.
from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
Confusion Matrix
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_test_pred)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for test')
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_train, y_train_pred)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for train')
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_valid, y_valid_pred)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for valid')
plt.show()
ROC CURVE
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
# Obtain predicted probabilities for the positive class for each dataset
train_predicted_probs = logistic_pipeline.predict_proba(X_train)[:, 1]
test_predicted_probs = logistic_pipeline.predict_proba(X_test)[:, 1]
valid_predicted_probs = logistic_pipeline.predict_proba(X_valid)[:, 1]
# Assuming y_train, y_test, y_valid contain the true labels for your train, test, and validation datasets respectively
train_true_labels = y_train
test_true_labels = y_test
valid_true_labels = y_valid
# Calculate the false positive rate, true positive rate, and thresholds for each dataset
train_fpr, train_tpr, _ = roc_curve(train_true_labels, train_predicted_probs)
test_fpr, test_tpr, _ = roc_curve(test_true_labels, test_predicted_probs)
valid_fpr, valid_tpr, _ = roc_curve(valid_true_labels, valid_predicted_probs)
# Calculate the area under the ROC curve for each dataset
train_roc_auc = auc(train_fpr, train_tpr)
test_roc_auc = auc(test_fpr, test_tpr)
valid_roc_auc = auc(valid_fpr, valid_tpr)
# Plot the ROC curves for train, test, and validation datasets on a single plot
plt.figure(figsize=(10, 6))
plt.plot(train_fpr, train_tpr, label='Train ROC Curve (AUC = {:.2f})'.format(train_roc_auc))
plt.plot(test_fpr, test_tpr, label='Test ROC Curve (AUC = {:.2f})'.format(test_roc_auc))
plt.plot(valid_fpr, valid_tpr, label='Validation ROC Curve (AUC = {:.2f})'.format(valid_roc_auc))
plt.plot([0, 1], [0, 1], 'r--', label='Random', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic for Train, Test, and Validation')
plt.legend(loc='lower right')
plt.show()
Decision Tree
from sklearn.tree import DecisionTreeClassifier
%%time
np.random.seed(42)
decision_tree_pipeline = Pipeline([
("preparation", column_transformer_pipeline),
("dt",DecisionTreeClassifier(max_depth=3))
])
model_DT = decision_tree_pipeline.fit(X_train, y_train)
start = time()
decision_tree_pipeline.fit(X_train, y_train)
duration_dt_train = np.round((time() - start), 4)
np.random.seed(42)
# Time and score for valid predictions
start = time()
# Predict on the valid data
y_pred_valid = model_DT.predict(X_valid)
# Calculate accuracy on the valid data
logit_score_valid = accuracy_score(y_valid, y_pred_valid)
valid_time = np.round(time() - start, 4)
print(valid_time)
# Time and score for test predictions
start = time()
# Predict on the testing data
y_pred_test = model_DT.predict(X_test)
# Calculate accuracy on the testing data
logit_score_test = accuracy_score(y_test, y_pred_test)
test_time = np.round(time() - start, 4)
print(test_time)
# Time and score for train predictions
start = time()
# Predict on the training data
y_pred_train = model_DT.predict(X_train)
# Calculate accuracy on the training data
logit_score_train = accuracy_score(y_train, y_pred_train)
train_time = np.round(time() - start, 4)
print(train_time)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
# obtained predicted probabilities or scores for train
predicted_probs = model_DT.predict_proba(X_train)[:, 1]
true_labels = y_train
# Calculate the AUC for train
auc_train = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for train
logloss_train = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for test predictions
predicted_probs = model_DT.predict_proba(X_test)[:, 1]
true_labels = y_test
# Calculate the AUC for test
auc_test = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for test
logloss_test = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for valid predictions
predicted_probs = model_DT.predict_proba(X_valid)[:, 1]
true_labels = y_valid
# Calculate the AUC for valid
auc_valid = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for valid
logloss_valid = log_loss(true_labels, predicted_probs)
#F1-Score for test
#y_test_pred = model_DT.predict(X_test)
f1_test = f1_score(y_test, y_pred_test)
print("F1-Score for DT is: " , np.round(f1_test, 4))
#F1-Score for train
#y_train_pred = model_DT.predict(X_train)
f1_train = f1_score(y_train, y_pred_train)
#F1-Score for valid
#y_valid_pred = model_DT.predict(X_valid)
f1_valid = f1_score(y_valid, y_pred_valid)
exp_name = f"Baseline_{len(selected_features)}_features"
experimentLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"Train F1 Score",
"Valid F1 Score",
"Test F1 Score",
"Train Log Loss",
"Valid Log Loss",
"Test Log Loss",
"Train Time",
"Valid Time",
"Test Time",
"Description"
])
experimentLog.loc[len(experimentLog)] = [f"{exp_name}"],logit_score_train, logit_score_valid,logit_score_test,auc_train,auc_test,auc_valid,f1_train,f1_valid,f1_test,logloss_test,logloss_train,logloss_valid,train_time,valid_time,test_time, f"Imbalanced Decision Tree reg features {tot_features}"
#experimentLog.loc[len(experimentLog)] = ["Logistic Regression as Baseline", "HCDR", f"{f(accuracy_train_LR)}%", f"{f(test_accuracy_LR)}%", f"{f(accuracy_valid_LR)}%", f"{f(duration_LR_train)}%", test_time, (f1_LR), f"{f(accuracy_for_LR)}%", f"{f(roc_auc)}%","Logistic Regression with numerical and categorical pipeline "]
experimentLog
Metrics For Decision Tree
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred_test)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for test')
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_valid, y_pred_valid)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for Valid')
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_train, y_pred_train)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for train')
plt.show()
ROC CURVE
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
# Obtain predicted probabilities for the positive class for each dataset
train_predicted_probs = model_DT.predict_proba(X_train)[:, 1]
test_predicted_probs = model_DT.predict_proba(X_test)[:, 1]
valid_predicted_probs = model_DT.predict_proba(X_valid)[:, 1]
# Assuming y_train, y_test, y_valid contain the true labels for your train, test, and validation datasets respectively
train_true_labels = y_train
test_true_labels = y_test
valid_true_labels = y_valid
# Calculate the false positive rate, true positive rate, and thresholds for each dataset
train_fpr, train_tpr, _ = roc_curve(train_true_labels, train_predicted_probs)
test_fpr, test_tpr, _ = roc_curve(test_true_labels, test_predicted_probs)
valid_fpr, valid_tpr, _ = roc_curve(valid_true_labels, valid_predicted_probs)
# Calculate the area under the ROC curve for each dataset
train_roc_auc = auc(train_fpr, train_tpr)
test_roc_auc = auc(test_fpr, test_tpr)
valid_roc_auc = auc(valid_fpr, valid_tpr)
# Plot the ROC curves for train, test, and validation datasets on a single plot
plt.figure(figsize=(10, 6))
plt.plot(train_fpr, train_tpr, label='Train ROC Curve (AUC = {:.2f})'.format(train_roc_auc))
plt.plot(test_fpr, test_tpr, label='Test ROC Curve (AUC = {:.2f})'.format(test_roc_auc))
plt.plot(valid_fpr, valid_tpr, label='Validation ROC Curve (AUC = {:.2f})'.format(valid_roc_auc))
plt.plot([0, 1], [0, 1], 'r--', label='Random', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic for Train, Test, and Validation')
plt.legend(loc='lower right')
plt.show()
Random Forest PIPELINE
from sklearn.ensemble import RandomForestClassifier
rf_pipeline = Pipeline([
("preparation", column_transformer_pipeline),
("classifier",RandomForestClassifier(max_depth=2))
])
start = time()
model_RF = rf_pipeline.fit(X_train, y_train)
duration_dt_train = np.round((time() - start), 4)
# Time and score for valid predictions
start = time()
# Predict on the valid data
y_pred_valid = model_RF.predict(X_valid)
# Calculate accuracy on the valid data
logit_score_valid = accuracy_score(y_valid, y_pred_valid)
valid_time = np.round(time() - start, 4)
print(valid_time)
# Time and score for test predictions
start = time()
# Predict on the testing data
y_pred_test = model_RF.predict(X_test)
# Calculate accuracy on the testing data
logit_score_test = accuracy_score(y_test, y_pred_test)
test_time = np.round(time() - start, 4)
print(test_time)
# Time and score for train predictions
start = time()
# Predict on the training data
y_pred_train = model_RF.predict(X_train)
# Calculate accuracy on the training data
logit_score_train = accuracy_score(y_train, y_pred_train)
train_time = np.round(time() - start, 4)
print(train_time)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
# obtained predicted probabilities or scores for train
predicted_probs = model_RF.predict_proba(X_train)[:, 1]
true_labels = y_train
# Calculate the AUC for train
auc_train = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for train
logloss_train = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for test predictions
predicted_probs = model_RF.predict_proba(X_test)[:, 1]
true_labels = y_test
# Calculate the AUC for test
auc_test = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for test
logloss_test = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for valid predictions
predicted_probs = model_RF.predict_proba(X_valid)[:, 1]
true_labels = y_valid
# Calculate the AUC for valid
auc_valid = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for valid
logloss_valid = log_loss(true_labels, predicted_probs)
#F1-Score for test
#y_test_pred = model_DT.predict(X_test)
f1_test = f1_score(y_test, y_pred_test)
print("F1-Score for DT is: " , np.round(f1_test, 4))
#F1-Score for train
#y_train_pred = model_DT.predict(X_train)
f1_train = f1_score(y_train, y_pred_train)
#F1-Score for valid
#y_valid_pred = model_DT.predict(X_valid)
f1_valid = f1_score(y_valid, y_pred_valid)
exp_name = f"Baseline_{len(selected_features)}_features"
experimentLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"Train F1 Score",
"Valid F1 Score",
"Test F1 Score",
"Train Log Loss",
"Valid Log Loss",
"Test Log Loss",
"Train Time",
"Valid Time",
"Test Time",
"Description"
])
experimentLog.loc[len(experimentLog)] = [f"{exp_name}"],logit_score_train, logit_score_valid,logit_score_test,auc_train,auc_test,auc_valid,f1_train,f1_valid,f1_test,logloss_test,logloss_train,logloss_valid,train_time,valid_time,test_time, f"Imbalanced Random Forest reg features {tot_features}"
#experimentLog.loc[len(experimentLog)] = ["Logistic Regression as Baseline", "HCDR", f"{f(accuracy_train_LR)}%", f"{f(test_accuracy_LR)}%", f"{f(accuracy_valid_LR)}%", f"{f(duration_LR_train)}%", test_time, (f1_LR), f"{f(accuracy_for_LR)}%", f"{f(roc_auc)}%","Logistic Regression with numerical and categorical pipeline "]
experimentLog
Metrics for Random Forest
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred_test)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for test')
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_train, y_pred_train)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for train')
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_valid, y_pred_valid)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for valid')
plt.show()
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
# Obtain predicted probabilities for the positive class for each dataset
train_predicted_probs = model_RF.predict_proba(X_train)[:, 1]
test_predicted_probs = model_RF.predict_proba(X_test)[:, 1]
valid_predicted_probs = model_RF.predict_proba(X_valid)[:, 1]
# Assuming y_train, y_test, y_valid contain the true labels for your train, test, and validation datasets respectively
train_true_labels = y_train
test_true_labels = y_test
valid_true_labels = y_valid
# Calculate the false positive rate, true positive rate, and thresholds for each dataset
train_fpr, train_tpr, _ = roc_curve(train_true_labels, train_predicted_probs)
test_fpr, test_tpr, _ = roc_curve(test_true_labels, test_predicted_probs)
valid_fpr, valid_tpr, _ = roc_curve(valid_true_labels, valid_predicted_probs)
# Calculate the area under the ROC curve for each dataset
train_roc_auc = auc(train_fpr, train_tpr)
test_roc_auc = auc(test_fpr, test_tpr)
valid_roc_auc = auc(valid_fpr, valid_tpr)
# Plot the ROC curves for train, test, and validation datasets on a single plot
plt.figure(figsize=(10, 6))
plt.plot(train_fpr, train_tpr, label='Train ROC Curve (AUC = {:.2f})'.format(train_roc_auc))
plt.plot(test_fpr, test_tpr, label='Test ROC Curve (AUC = {:.2f})'.format(test_roc_auc))
plt.plot(valid_fpr, valid_tpr, label='Validation ROC Curve (AUC = {:.2f})'.format(valid_roc_auc))
plt.plot([0, 1], [0, 1], 'r--', label='Random', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic for Train, Test, and Validation For DT')
plt.legend(loc='lower right')
plt.show()
Doing experiment with different features
# Split the provided training data into training and validationa and test
# The kaggle evaluation test set has no labels
#
from sklearn.model_selection import train_test_split
use_application_data_ONLY = True #use joined data
if use_application_data_ONLY:
# just selected a few features for a baseline experiment
selected_features = ['AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
'EXT_SOURCE_2','EXT_SOURCE_3','CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
X_train = datasets["application_train"][selected_features]
y_train = datasets["application_train"]['TARGET']
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_kaggle_test= datasets["application_test"][selected_features]
# y_test = datasets["application_test"]['TARGET'] #why no TARGET?!! (hint: kaggle competition)
X_train = datasets["application_train"]
selected_features = ['AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
'EXT_SOURCE_2','EXT_SOURCE_3','CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
y_train = X_train['TARGET']
X_train = X_train[selected_features]
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_kaggle_test= X_kaggle_test[selected_features]
# y_test = datasets["application_test"]['TARGET'] #why no TARGET?!! (hint: kaggle competition)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Identify the numeric features we wish to consider.
num_attribs = [
'AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
'EXT_SOURCE_2','EXT_SOURCE_3']
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', SimpleImputer(strategy='mean')),
('std_scaler', StandardScaler()),
])
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
#('imputer', SimpleImputer(strategy='most_frequent')),
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_prep_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
list(datasets["application_train"].columns)
selected_features = num_attribs + cat_attribs
tot_features = f"{len(selected_features)}: Num:{len(num_attribs)}, Cat:{len(cat_attribs)}"
#Total Feature selected for processing
tot_features
%%time
np.random.seed(42)
logistic_pipeline = Pipeline([
("preparation", data_prep_pipeline),# combination of numerical, categorical subpipelines
#("clf", MultinomialNB()) # classifier estimator you are using
("logistic_regression", LogisticRegression() )
])
start = time()
logistic_pipeline.fit(X_train, y_train)
duration_logistic_train = np.round((time() - start), 4)
np.random.seed(42)
# Time and score for valid predictions
start = time()
logit_score_valid = logistic_pipeline.score(X_valid, y_valid)
valid_time = np.round(time() - start, 4)
# Time and score for valid predictions
start = time()
logit_score_valid = logistic_pipeline.score(X_valid, y_valid)
valid_time = np.round(time() - start, 4)
# Time and score for test predictions
start = time()
logit_score_test = logistic_pipeline.score(X_test, y_test)
test_time = np.round(time() - start, 4)
# Time and score for train predictions
start = time()
logit_score_train = logistic_pipeline.score(X_train, y_train)
train_time = np.round(time() - start, 4)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
# obtained predicted probabilities or scores for train
predicted_probs = logistic_pipeline.predict_proba(X_train)[:, 1]
true_labels = y_train
# Calculate the AUC for train
auc_train = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for train
logloss_train = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for test predictions
predicted_probs = logistic_pipeline.predict_proba(X_test)[:, 1]
true_labels = y_test
# Calculate the AUC for test
auc_test = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for test
logloss_test = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for valid predictions
predicted_probs = logistic_pipeline.predict_proba(X_valid)[:, 1]
true_labels = y_valid
# Calculate the AUC for test
auc_valid = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for valid
logloss_valid = log_loss(true_labels, predicted_probs)
#F1-Score for test
y_test_pred = logistic_pipeline.predict(X_test)
f1_test = f1_score(y_test, y_test_pred)
print("F1-Score for Logistic Regression is: " , np.round(f1_test, 4))
#F1-Score for train
y_train_pred = logistic_pipeline.predict(X_train)
f1_train = f1_score(y_train, y_train_pred)
#F1-Score for valid
y_valid_pred = logistic_pipeline.predict(X_valid)
f1_valid = f1_score(y_valid, y_valid_pred)
exp_name = f"Baseline_{len(selected_features)}_features"
experimentLog.loc[len(experimentLog)] = [f"{exp_name}"],logit_score_train, logit_score_valid,logit_score_test,auc_train,auc_test,auc_valid,f1_train,f1_valid,f1_test,logloss_test,logloss_train,logloss_valid,train_time,valid_time,test_time, f"Imbalanced Logistic Regression reg features {tot_features}"
#experimentLog.loc[len(experimentLog)] = ["Logistic Regression as Baseline", "HCDR", f"{f(accuracy_train_LR)}%", f"{f(test_accuracy_LR)}%", f"{f(accuracy_valid_LR)}%", f"{f(duration_LR_train)}%", test_time, (f1_LR), f"{f(accuracy_for_LR)}%", f"{f(roc_auc)}%","Logistic Regression with numerical and categorical pipeline "]
experimentLog
Metrics for 14 features Logistic Regression
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_test_pred)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for test')
plt.show()
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
# Obtain predicted probabilities for the positive class for each dataset
test_predicted_probs = logistic_pipeline.predict_proba(X_test)[:, 1]
test_true_labels = y_test
# Calculate the false positive rate, true positive rate, and thresholds for each dataset
test_fpr, test_tpr, _ = roc_curve(test_true_labels, test_predicted_probs)
# Calculate the area under the ROC curve for each dataset
test_roc_auc = auc(test_fpr, test_tpr)
# Plot the ROC curves for train, test, and validation datasets on a single plot
plt.figure(figsize=(10, 6))
plt.plot(test_fpr, test_tpr, label='Test ROC Curve (AUC = {:.2f})'.format(test_roc_auc))
plt.plot([0, 1], [0, 1], 'r--', label='Random', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic for Test For Logistic')
plt.legend(loc='lower right')
plt.show()
Decision Tree for 14 features
from sklearn.tree import DecisionTreeClassifier
%%time
np.random.seed(42)
decision_tree_pipeline = Pipeline([
("preparation", data_prep_pipeline),
("dt",DecisionTreeClassifier(max_depth=3))
])
model_DT = decision_tree_pipeline.fit(X_train, y_train)
start = time()
decision_tree_pipeline.fit(X_train, y_train)
duration_dt_train = np.round((time() - start), 4)
np.random.seed(42)
# Time and score for valid predictions
start = time()
logit_score_valid = decision_tree_pipeline.score(X_valid, y_valid)
valid_time = np.round(time() - start, 4)
# Time and score for test predictions
start = time()
logit_score_test = decision_tree_pipeline.score(X_test, y_test)
test_time = np.round(time() - start, 4)
# Time and score for train predictions
start = time()
logit_score_train = decision_tree_pipeline.score(X_train, y_train)
train_time = np.round(time() - start, 4)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
# obtained predicted probabilities or scores for train
predicted_probs = decision_tree_pipeline.predict_proba(X_train)[:, 1]
true_labels = y_train
# Calculate the AUC for train
auc_train = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for train
logloss_train = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for test predictions
predicted_probs = decision_tree_pipeline.predict_proba(X_test)[:, 1]
true_labels = y_test
# Calculate the AUC for test
auc_test = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for test
logloss_test = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for valid predictions
predicted_probs = decision_tree_pipeline.predict_proba(X_valid)[:, 1]
true_labels = y_valid
# Calculate the AUC for test
auc_valid = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for valid
logloss_valid = log_loss(true_labels, predicted_probs)
#F1-Score for test
y_test_pred = decision_tree_pipeline.predict(X_test)
f1_test = f1_score(y_test, y_test_pred)
print("F1-Score for DT is: " , np.round(f1_test, 4))
#F1-Score for train
y_train_pred = decision_tree_pipeline.predict(X_train)
f1_train = f1_score(y_train, y_train_pred)
#F1-Score for valid
y_valid_pred = decision_tree_pipeline.predict(X_valid)
f1_valid = f1_score(y_valid, y_valid_pred)
exp_name = f"Baseline_{len(selected_features)}_features"
experimentLog.loc[len(experimentLog)] = [f"{exp_name}"],logit_score_train, logit_score_valid,logit_score_test,auc_train,auc_test,auc_valid,f1_train,f1_valid,f1_test,logloss_test,logloss_train,logloss_valid,train_time,valid_time,test_time, f"Imbalanced Decision Tree reg features {tot_features}"
#experimentLog.loc[len(experimentLog)] = ["Logistic Regression as Baseline", "HCDR", f"{f(accuracy_train_LR)}%", f"{f(test_accuracy_LR)}%", f"{f(accuracy_valid_LR)}%", f"{f(duration_LR_train)}%", test_time, (f1_LR), f"{f(accuracy_for_LR)}%", f"{f(roc_auc)}%","Logistic Regression with numerical and categorical pipeline "]
experimentLog
Metrics for DT
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_test_pred)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for test')
plt.show()
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
# Obtain predicted probabilities for the positive class for each dataset
test_predicted_probs = decision_tree_pipeline.predict_proba(X_test)[:, 1]
test_true_labels = y_test
# Calculate the false positive rate, true positive rate, and thresholds for each dataset
test_fpr, test_tpr, _ = roc_curve(test_true_labels, test_predicted_probs)
# Calculate the area under the ROC curve for each dataset
test_roc_auc = auc(test_fpr, test_tpr)
# Plot the ROC curves for train, test, and validation datasets on a single plot
plt.figure(figsize=(10, 6))
plt.plot(test_fpr, test_tpr, label='Test ROC Curve (AUC = {:.2f})'.format(test_roc_auc))
plt.plot([0, 1], [0, 1], 'r--', label='Random', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic for Test For Decision Tree')
plt.legend(loc='lower right')
plt.show()
Random Forest for 14 features
from sklearn.ensemble import RandomForestClassifier
rf_pipeline = Pipeline([
("preparation", data_prep_pipeline),
("classifier",RandomForestClassifier(max_depth=2))
])
start = time()
model_RF = rf_pipeline.fit(X_train, y_train)
duration_dt_train = np.round((time() - start), 4)
start = time()
rf_pipeline.fit(X_train, y_train)
duration_rf_train = np.round((time() - start), 4)
np.random.seed(42)
# Time and score for valid predictions
start = time()
logit_score_valid = rf_pipeline.score(X_valid, y_valid)
valid_time = np.round(time() - start, 4)
# Time and score for test predictions
start = time()
logit_score_test = rf_pipeline.score(X_test, y_test)
test_time = np.round(time() - start, 4)
# Time and score for train predictions
start = time()
logit_score_train = rf_pipeline.score(X_train, y_train)
train_time = np.round(time() - start, 4)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
# obtained predicted probabilities or scores for train
predicted_probs = rf_pipeline.predict_proba(X_train)[:, 1]
true_labels = y_train
# Calculate the AUC for train
auc_train = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for train
logloss_train = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for test predictions
predicted_probs = rf_pipeline.predict_proba(X_test)[:, 1]
true_labels = y_test
# Calculate the AUC for test
auc_test = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for test
logloss_test = log_loss(true_labels, predicted_probs)
# obtained predicted probabilities or scores for valid predictions
predicted_probs = rf_pipeline.predict_proba(X_valid)[:, 1]
true_labels = y_valid
# Calculate the AUC for test
auc_valid = roc_auc_score(true_labels, predicted_probs)
# Calculate the log loss for valid
logloss_valid = log_loss(true_labels, predicted_probs)
#F1-Score for test
y_test_pred1 = rf_pipeline.predict(X_test)
f1_test = f1_score(y_test, y_test_pred1)
print("F1-Score for DT is: " , np.round(f1_test, 4))
#F1-Score for train
y_train_pred = rf_pipeline.predict(X_train)
f1_train = f1_score(y_train, y_train_pred)
#F1-Score for valid
y_valid_pred = rf_pipeline.predict(X_valid)
f1_valid = f1_score(y_valid, y_valid_pred)
exp_name = f"Baseline_{len(selected_features)}_features"
experimentLog.loc[len(experimentLog)] = [f"{exp_name}"],logit_score_train, logit_score_valid,logit_score_test,auc_train,auc_test,auc_valid,f1_train,f1_valid,f1_test,logloss_test,logloss_train,logloss_valid,train_time,valid_time,test_time, f"Imbalanced Random Forest reg features {tot_features}"
#experimentLog.loc[len(experimentLog)] = ["Logistic Regression as Baseline", "HCDR", f"{f(accuracy_train_LR)}%", f"{f(test_accuracy_LR)}%", f"{f(accuracy_valid_LR)}%", f"{f(duration_LR_train)}%", test_time, (f1_LR), f"{f(accuracy_for_LR)}%", f"{f(roc_auc)}%","Logistic Regression with numerical and categorical pipeline "]
experimentLog
Metrics for Random forest
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_test_pred1)
# Create a heatmap for the confusion matrix
sns.heatmap(cm/np.sum(cm), annot=True, fmt=".3%", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix for test')
plt.show()
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
# Obtain predicted probabilities for the positive class for each dataset
test_predicted_probs = rf_pipeline.predict_proba(X_test)[:, 1]
test_true_labels = y_test
# Calculate the false positive rate, true positive rate, and thresholds for each dataset
test_fpr, test_tpr, _ = roc_curve(test_true_labels, test_predicted_probs)
# Calculate the area under the ROC curve for each dataset
test_roc_auc = auc(test_fpr, test_tpr)
# Plot the ROC curves for train, test, and validation datasets on a single plot
plt.figure(figsize=(10, 6))
plt.plot(test_fpr, test_tpr, label='Test ROC Curve (AUC = {:.2f})'.format(test_roc_auc))
plt.plot([0, 1], [0, 1], 'r--', label='Random', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic for Test For Random forest')
plt.legend(loc='lower right')
plt.show()
#df = datasets["application_train"]
df = X_train
#input_f = datasets["application_train"]
X = df.drop('TARGET', axis=1) # Features
y = df['TARGET'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42) # Split data into train and test sets
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,
test_size=0.2, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
#X_train_resampled, y_train_resampled = resample(X_train[y_train == 0], y_train[y_train == 0], n_samples=len(X_train[y_train == 1]), random_state=42)
#X_train = pd.concat([X_train[y_train == 1], X_train_resampled])
#y_train = pd.concat([y_train[y_train == 1], y_train_resampled])
#getting the numerical features from the application train table which contains int and float datatypes.
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_features.remove("TARGET")
numerical_features
len(numerical_features)
#getting the categorical features from the application train table which contains object datatypes.
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
categorical_features
len(categorical_features)
from sklearn.base import BaseEstimator, TransformerMixin
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Create numerical pipeline
numerical_pipeline = Pipeline([
('selector', DataFrameSelector(numerical_features)),
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Create categorical pipeline
categorical_pipeline = Pipeline([
('selector', DataFrameSelector(categorical_features)),
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown= 'ignore', sparse=False))
])
# Create column transformer
column_transformer_pipeline = ColumnTransformer(
transformers=[
('num_pipeline', numerical_pipeline),
('cat_pipeline', categorical_pipeline)
], remainder="passthrough")
from sklearn.pipeline import FeatureUnion
data_prep_pipeline_FU = FeatureUnion(transformer_list=[
("num_pipeline", numerical_pipeline),
("cat_pipeline", categorical_pipeline),
])
selected_features = numerical_features + categorical_features
tot_features = f"{len(selected_features)}: Num:{len(numerical_features)}, Cat:{len(categorical_features)}"
#Total Feature selected for processing
tot_features
# Set feature selection settings
# Features removed each step
feature_selection_steps=10
# Number of features used
features_used=len(selected_features)
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
def plot_roc_curve(model, X_train, y_train, X_test, y_test, X_valid, y_valid, name):
"""
Plots ROC curves for train, test, and validation sets
"""
# Compute ROC curve and ROC area for train set
fpr_train, tpr_train, _ = roc_curve(y_train, model.predict_proba(X_train)[:, 1])
roc_auc_train = auc(fpr_train, tpr_train)
# Compute ROC curve and ROC area for test set
fpr_test, tpr_test, _ = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
roc_auc_test = auc(fpr_test, tpr_test)
# Compute ROC curve and ROC area for validation set
fpr_valid, tpr_valid, _ = roc_curve(y_valid, model.predict_proba(X_valid)[:, 1])
roc_auc_valid = auc(fpr_valid, tpr_valid)
# Plot ROC curves
plt.figure(figsize=(10, 7))
plt.plot(fpr_train, tpr_train, color='darkorange', lw=2, label='Train ROC curve (area = %0.2f)' % roc_auc_train)
plt.plot(fpr_test, tpr_test, color='blue', lw=2, label='Test ROC curve (area = %0.2f)' % roc_auc_test)
plt.plot(fpr_valid, tpr_valid, color='green', lw=2, label='Validation ROC curve (area = %0.2f)' % roc_auc_valid)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve - ' + name)
plt.legend(loc="lower right")
plt.show()
from sklearn.metrics import precision_recall_curve
def plot_precision_recall_all(model, X_train, y_train, X_test, y_test, X_valid, y_valid, name):
# Get precision and recall values for train, test, and valid sets
train_precision, train_recall, train_thresholds = precision_recall_curve(y_train, model.predict_proba(X_train)[:, 1])
test_precision, test_recall, test_thresholds = precision_recall_curve(y_test, model.predict_proba(X_test)[:, 1])
valid_precision, valid_recall, valid_thresholds = precision_recall_curve(y_valid, model.predict_proba(X_valid)[:, 1])
# Plot the precision-recall curves
plt.plot(train_recall, train_precision, label='Train')
plt.plot(test_recall, test_precision, label='Test')
plt.plot(valid_recall, valid_precision, label='Valid')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve - ' + name)
plt.legend()
plt.show()
import seaborn as sns
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(model, X_train, y_train, X_test, y_test, X_valid, y_valid, name):
# Create confusion matrices for train, test and validation sets
train_cm = confusion_matrix(y_train, model.predict(X_train))
test_cm = confusion_matrix(y_test, model.predict(X_test))
valid_cm = confusion_matrix(y_valid, model.predict(X_valid))
# Plot confusion matrices
fig, ax = plt.subplots(1, 3, figsize=(18, 5))
sns.heatmap(train_cm/np.sum(train_cm), annot=True, fmt=".3%", cmap="Blues", ax=ax[0])
ax[0].set_title('Train Confusion Matrix - ' + name)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(test_cm/np.sum(test_cm), annot=True, fmt=".3%", cmap="Blues",ax=ax[1])
ax[1].set_title('Test Confusion Matrix - ' + name)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
sns.heatmap(valid_cm/np.sum(valid_cm), annot=True, fmt=".3%", cmap='Blues', ax=ax[2])
ax[2].set_title('Validation Confusion Matrix - ' + name)
ax[2].set_xlabel('Predicted')
ax[2].set_ylabel('Actual')
plt.show()
def pct(x):
return round(100*x,3)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_selection import RFE
classifiers = [ [('Logistic Regression', LogisticRegression(solver='saga', random_state=42), "RFE")],
[('Support Vector Machine', SVC(random_state=42, probability=True),"SVM")],
[('Decision Tree', DecisionTreeClassifier(random_state=42), "RFE")],
[('Random Forest', RandomForestClassifier(random_state=42), "RFE")],
[('Gradient Boosting', GradientBoostingClassifier(warm_start=True, random_state=42), "RFE")]
]
params_grid = {
'Logistic Regression': {
'penalty': ['l1', 'l2'],
'tol': [0.0001, 0.00001],
'C': [10, 1, 0.1, 0.01],
},
'Support Vector Machine': {
'kernel': ['linear'],
'degree': [4, 5],
'C': [0.001, 0.01], #Low C - allow for misclassification
'gamma': [0.01, 0.1, 1] #Low gamma - high variance and low bias
},
'Decision Tree': {
'criterion': ['gini', 'entropy'],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 3, 5]
},
'Random Forest': {
'n_estimators': [1000],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 3, 5],
'bootstrap': [True, False]
},
'Gradient Boosting': {
'n_estimators': [1000],
'max_depth': [5, 10, 15],
'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 3, 5],
'learning_rate': [0.01, 0.1, 1]
}
}
params_grid_lr = {
'lr__penalty': ['l1', 'l2'],
'lr__tol': [0.0001, 0.00001],
'lr__C': [10, 1, 0.1, 0.01],
}
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline_FU),
#('RFE', RFE(estimator=LogisticRegression(solver='saga', random_state=42), n_features_to_select=features_used, step=feature_selection_steps)),
("lr", LogisticRegression(solver='saga', random_state=42))
])
grid_search = GridSearchCV(full_pipeline_with_predictor, params_grid_lr, cv=2, scoring='roc_auc',
n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
# Best estimator score
best_train = pct(grid_search.best_score_)
print(grid_search.best_params_)
from sklearn.metrics import log_loss
# get the best model from GridSearchCV
best_model = grid_search.best_estimator_
# calculate scores on training set
best_train_accuracy_lr = np.round(best_model.score(X_train, y_train), 4)
best_train_f1_lr = np.round(f1_score(y_train, best_model.predict(X_train)), 4)
best_train_logloss_lr = np.round(log_loss(y_train, best_model.predict_proba(X_train)), 4)
best_train_roc_auc_lr = np.round(roc_auc_score(y_train, best_model.predict_proba(X_train)[:,1]), 4)
# calculate scores on validation set
best_valid_accuracy_lr = np.round(best_model.score(X_valid, y_valid), 4)
best_valid_f1_lr = np.round(f1_score(y_valid, best_model.predict(X_valid)), 4)
best_valid_logloss_lr = np.round(log_loss(y_valid, best_model.predict_proba(X_valid)), 4)
best_valid_roc_auc_lr = np.round(roc_auc_score(y_valid, best_model.predict_proba(X_valid)[:,1]), 4)
# calculate scores on test set
best_test_accuracy_lr = np.round(best_model.score(X_test, y_test), 4)
best_test_f1_lr = np.round(f1_score(y_test, best_model.predict(X_test)), 4)
best_test_logloss_lr = np.round(log_loss(y_test, best_model.predict_proba(X_test)), 4)
best_test_roc_auc_lr = np.round(roc_auc_score(y_test, best_model.predict_proba(X_test)[:,1]), 4)
print(best_train_accuracy_lr,best_valid_accuracy_lr,best_test_accuracy_lr)
print("Fit and Prediction with best estimator")
start = time()
model = grid_search.best_estimator_.fit(X_train, y_train)
train_time_lr = round(time() - start, 4)
print(train_time_lr)
plot_confusion_matrix(best_model,X_train,y_train,X_test,y_test,X_valid, y_valid,"Logistic Regression")
plot_roc_curve(best_model,X_train,y_train,X_test, y_test,X_valid, y_valid,"Logistic Regression")
plot_precision_recall_all(best_model,X_train,y_train,X_test, y_test,X_valid, y_valid,"Logistic Regression")
params_grid_dt = {
'dt__criterion': ['gini', 'entropy'],
'dt__max_depth': [5,10],
'dt__min_samples_split': [2,5, 10],
'dt__max_features': ['auto','sqrt','log2'],
'dt__splitter':['best','random']
}
full_pipeline_with_predictor_dt = Pipeline([
("preparation", data_prep_pipeline_FU),
#('RFE', RFE(estimator=LogisticRegression(solver='saga', random_state=42), n_features_to_select=features_used, step=feature_selection_steps)),
("dt", DecisionTreeClassifier(random_state=42))
])
grid_search_dt = GridSearchCV(full_pipeline_with_predictor_dt, params_grid_dt, cv=2, scoring='roc_auc',
n_jobs=-1,verbose=1)
grid_search_dt.fit(X_train, y_train)
# Best estimator score
best_train_dt = pct(grid_search_dt.best_score_)
print(grid_search_dt.best_params_)
from sklearn.metrics import log_loss
# get the best model from GridSearchCV
best_model_dt = grid_search_dt.best_estimator_
# calculate scores on training set
best_train_accuracy_dt = np.round(best_model_dt.score(X_train, y_train), 4)
best_train_f1_dt = np.round(f1_score(y_train, best_model_dt.predict(X_train)), 4)
best_train_logloss_dt = np.round(log_loss(y_train, best_model_dt.predict_proba(X_train)), 4)
best_train_roc_auc_dt = np.round(roc_auc_score(y_train, best_model_dt.predict_proba(X_train)[:,1]), 4)
# calculate scores on validation set
best_valid_accuracy_dt = np.round(best_model_dt.score(X_valid, y_valid), 4)
best_valid_f1_dt = np.round(f1_score(y_valid, best_model_dt.predict(X_valid)), 4)
best_valid_logloss_dt = np.round(log_loss(y_valid, best_model_dt.predict_proba(X_valid)), 4)
best_valid_roc_auc_dt = np.round(roc_auc_score(y_valid, best_model_dt.predict_proba(X_valid)[:,1]), 4)
# calculate scores on test set
best_test_accuracy_dt = np.round(best_model_dt.score(X_test, y_test), 4)
best_test_f1_dt = np.round(f1_score(y_test, best_model_dt.predict(X_test)), 4)
best_test_logloss_dt = np.round(log_loss(y_test, best_model_dt.predict_proba(X_test)), 4)
best_test_roc_auc_dt = np.round(roc_auc_score(y_test, best_model_dt.predict_proba(X_test)[:,1]), 4)
print("Fit and Prediction with best estimator")
start = time()
model = grid_search_dt.best_estimator_.fit(X_train, y_train)
train_time_dt = round(time() - start, 4)
print(train_time_dt)
plot_confusion_matrix(best_model_dt,X_train,y_train,X_test,y_test,X_valid, y_valid,"Decision Tree")
plot_roc_curve(best_model_dt,X_train,y_train,X_test, y_test,X_valid, y_valid,"Decision Tree")
plot_precision_recall_all(best_model_dt,X_train,y_train,X_test, y_test,X_valid, y_valid,"Decision Tree")
print(best_train_accuracy_dt,best_valid_accuracy_dt,best_test_accuracy_dt)
X_train.head()
params_grid_rf = {
'rf__n_estimators': [50,200],
'rf__max_depth': [2,5],
'rf__min_samples_split': [2, 5]
}
full_pipeline_with_predictor_rf = Pipeline([
("preparation", data_prep_pipeline_FU),
#('RFE', RFE(estimator=LogisticRegression(solver='saga', random_state=42), n_features_to_select=features_used, step=feature_selection_steps)),
("rf", RandomForestClassifier(random_state=42))
])
grid_search_rf = GridSearchCV(full_pipeline_with_predictor_rf, params_grid_rf, cv=2, scoring='roc_auc',
n_jobs=-1,verbose=1)
grid_search_rf.fit(X_train, y_train)
# Best estimator score
best_train_rf = pct(grid_search_rf.best_score_)
print(grid_search_rf.best_params_)
from sklearn.metrics import log_loss
# get the best model from GridSearchCV
best_model_rf = grid_search_rf.best_estimator_
# calculate scores on training set
best_train_accuracy_rf = np.round(grid_search_rf.score(X_train, y_train), 4)
best_train_f1_rf = np.round(f1_score(y_train, grid_search_rf.predict(X_train)), 4)
best_train_logloss_rf = np.round(log_loss(y_train, grid_search_rf.predict_proba(X_train)), 4)
best_train_roc_auc_rf = np.round(roc_auc_score(y_train, grid_search_rf.predict_proba(X_train)[:,1]), 4)
# calculate scores on validation set
best_valid_accuracy_rf = np.round(grid_search_rf.score(X_valid, y_valid), 4)
best_valid_f1_rf = np.round(f1_score(y_valid, grid_search_rf.predict(X_valid)), 4)
best_valid_logloss_rf = np.round(log_loss(y_valid, grid_search_rf.predict_proba(X_valid)), 4)
best_valid_roc_auc_rf = np.round(roc_auc_score(y_valid, grid_search_rf.predict_proba(X_valid)[:,1]), 4)
# calculate scores on test set
best_test_accuracy_rf = np.round(grid_search_rf.score(X_test, y_test), 4)
best_test_f1_rf = np.round(f1_score(y_test, grid_search_rf.predict(X_test)), 4)
best_test_logloss_rf = np.round(log_loss(y_test, grid_search_rf.predict_proba(X_test)), 4)
best_test_roc_auc_rf = np.round(roc_auc_score(y_test, grid_search_rf.predict_proba(X_test)[:,1]), 4)
print("Fit and Prediction with best estimator")
start = time()
model = grid_search_rf.best_estimator_.fit(X_train, y_train)
train_time_rf = round(time() - start, 4)
print(train_time_rf)
print(best_train_accuracy_rf,best_valid_accuracy_rf,best_test_accuracy_rf)
print(best_valid_f1_rf)
plot_confusion_matrix(best_model_rf,X_train,y_train,X_test,y_test,X_valid, y_valid,"Random Forest")
plot_roc_curve(best_model_rf,X_train,y_train,X_test, y_test,X_valid, y_valid,"Random Forest")
plot_precision_recall_all(best_model_rf,X_train,y_train,X_test, y_test,X_valid, y_valid,"Random Forest")
Experiment Log
exp_name = f"Baseline_{len(selected_features)}_features"
experimentLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"Train F1 Score",
"Valid F1 Score",
"Test F1 Score",
"Train Log Loss",
"Valid Log Loss",
"Test Log Loss",
"Train Time",
"Description"
])
#Logistic Regression
exp_name = "Logistic Regression"
experimentLog.loc[len(experimentLog)] = [f"{exp_name}"],best_train_accuracy_lr, best_valid_accuracy_lr,best_test_accuracy_lr,best_train_roc_auc_lr,best_valid_roc_auc_lr,best_test_roc_auc_lr,best_train_f1_lr,best_valid_f1_lr,best_test_f1_lr,best_train_logloss_lr,best_valid_logloss_lr,best_test_logloss_lr,train_time_lr, f"{str(grid_search.best_params_)}"
#Decision Tree
exp_name = "Decision Tree"
experimentLog.loc[len(experimentLog)] = [f"{exp_name}"],best_train_accuracy_dt, best_valid_accuracy_dt,best_test_accuracy_dt,best_train_roc_auc_dt,best_valid_roc_auc_dt,best_test_roc_auc_dt,best_train_f1_dt,best_valid_f1_dt,best_test_f1_dt,best_train_logloss_dt,best_valid_logloss_dt,best_test_logloss_dt,train_time_dt, str(grid_search_dt.best_params_)
#Random Forest
exp_name = "Random Forest"
experimentLog.loc[len(experimentLog)] = [f"{exp_name}"],best_train_accuracy_rf, best_valid_accuracy_rf,best_test_accuracy_rf,best_train_roc_auc_rf,best_valid_roc_auc_rf,best_test_roc_auc_rf,best_train_f1_rf,best_valid_f1_rf,best_test_f1_rf,best_train_logloss_rf,best_valid_logloss_rf,best_test_logloss_rf,train_time_rf, str(grid_search_rf.best_params_)
experimentLog
params_grid_gb = {
'gb__n_estimators': [1000],
'gb__max_depth': [5, 10],
'gb__max_features': ['auto', 'sqrt', 'log2'],
'gb__min_samples_split': [5, 10],
'gb__learning_rate': [0.01, 0.1, 1]
}
full_pipeline_with_predictor_gb = Pipeline([
("preparation", data_prep_pipeline_FU),
#('RFE', RFE(estimator=LogisticRegression(solver='saga', random_state=42), n_features_to_select=features_used, step=feature_selection_steps)),
("gb", GradientBoostingClassifier(random_state=42))
])
grid_search_gb = GridSearchCV(full_pipeline_with_predictor_gb, params_grid_gb, cv=2, scoring='roc_auc',
n_jobs=-1,verbose=1)
grid_search_gb.fit(X_train, y_train)
# Best estimator score
best_train_gb = pct(grid_search_gb.best_score_)
print(grid_search_gb.best_params_)
from sklearn.metrics import log_loss
# get the best model from GridSearchCV
best_model_gb = grid_search_gb.best_estimator_
# calculate scores on training set
best_train_accuracy_gb = np.round(grid_search_gb.score(X_train, y_train), 4)
best_train_f1_gb = np.round(f1_score(y_train, grid_search_gb.predict(X_train)), 4)
best_train_logloss_gb = np.round(log_loss(y_train, grid_search_gb.predict_proba(X_train)), 4)
best_train_roc_auc_gb = np.round(roc_auc_score(y_train, grid_search_gb.predict_proba(X_train)[:,1]), 4)
# calculate scores on validation set
best_valid_accuracy_gb = np.round(grid_search_gb.score(X_valid, y_valid), 4)
best_valid_f1_gb = np.round(f1_score(y_valid, grid_search_gb.predict(X_valid)), 4)
best_valid_logloss_gb = np.round(log_loss(y_valid, grid_search_gb.predict_proba(X_valid)), 4)
best_valid_roc_auc_gb = np.round(roc_auc_score(y_valid, grid_search_gb.predict_proba(X_valid)[:,1]), 4)
# calculate scores on test set
best_test_accuracy_gb = np.round(grid_search_gb.score(X_test, y_test), 4)
best_test_f1_gb = np.round(f1_score(y_test, grid_search_gb.predict(X_test)), 4)
best_test_logloss_gb = np.round(log_loss(y_test, grid_search_gb.predict_proba(X_test)), 4)
best_test_roc_auc_gb = np.round(roc_auc_score(y_test, grid_search_gb.predict_proba(X_test)[:,1]), 4)
print("Fit and Prediction with best estimator")
start = time()
model = grid_search_gb.best_estimator_.fit(X_train, y_train)
train_time_gb = round(time() - start, 4)
print(train_time_gb)
print(best_train_accuracy_rf,best_valid_accuracy_rf,best_test_accuracy_rf)
print(best_valid_f1_rf)
plot_confusion_matrix(best_model_gb,X_train,y_train,X_test,y_test,X_valid, y_valid,"Gradient Boosting")
plot_roc_curve(best_model_gb,X_train,y_train,X_test, y_test,X_valid, y_valid,"Gradient Boosting")
plot_precision_recall_all(best_model_gb,X_train,y_train,X_test, y_test,X_valid, y_valid,"Gradient Boosting")
!pip install -q pytorch-lightning
import pandas as pd
import torch
import pytorch_lightning as pl
from torch import nn
from torchmetrics import Accuracy
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from numpy import vstack
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer, roc_auc_score
# Load the data
df = pd.read_csv('X_train_final')
# Split the dataset into features and target
X = df.drop('TARGET', axis=1)
y = df['TARGET']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the numerical features and categorical features
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_features.remove("TARGET")
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
# Define the numerical pipeline
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Define the categorical pipeline
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine the numerical and categorical pipelines using ColumnTransformer
preprocessor = ColumnTransformer([
('num', num_pipeline, numerical_features),
('cat', cat_pipeline, categorical_features)
])
# Preprocess the training data
X_train = preprocessor.fit_transform(X_train)
# Preprocess the testing data
X_test = preprocessor.transform(X_test)
# Convert the data to PyTorch tensors
X_train = torch.Tensor(X_train)
X_test = torch.Tensor(X_test)
y_train = torch.Tensor(y_train.values).unsqueeze(1)
y_test = torch.Tensor(y_test.values).unsqueeze(1)
class MLP(pl.LightningModule):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.train_acc = Accuracy("binary", num_classes=2)
self.valid_acc = Accuracy("binary", num_classes=2)
self.test_acc = Accuracy("binary", num_classes=2)
self.input_layer = nn.Linear(input_dim, hidden_dim[0])
self.hidden_layers = nn.ModuleList()
for i in range(len(hidden_dim)-1):
self.hidden_layers.append(nn.Linear(hidden_dim[i], hidden_dim[i+1]))
self.output_layer = nn.Linear(hidden_dim[-1], output_dim)
nn.init.kaiming_uniform_(self.input_layer.weight)
for hidden_layer in self.hidden_layers:
nn.init.kaiming_uniform_(hidden_layer.weight)
nn.init.xavier_uniform_(self.output_layer.weight)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
self.loss = nn.BCELoss()
# def __init__(self, input_dim, hidden_dim, output_dim):
# super().__init__()
# self.train_acc = Accuracy("binary", num_classes=2)
# self.valid_acc = Accuracy("binary", num_classes=2)
# self.test_acc = Accuracy("binary", num_classes=2)
# self.input_layer = nn.Linear(input_dim, hidden_dim[0])
# self.hidden_layers = nn.ModuleList()
# for i in range(len(hidden_dim)-1):
# self.hidden_layers.append(nn.Linear(hidden_dim[i], hidden_dim[i+1]))
# self.output_layer = nn.Linear(hidden_dim[-1], output_dim)
# self.relu = nn.ReLU()
# self.sigmoid = nn.Sigmoid()
# self.loss = nn.BCELoss()
def forward(self, x):
x = self.input_layer(x)
x = self.relu(x)
for hidden_layer in self.hidden_layers:
x = hidden_layer(x)
x = self.relu(x)
x = self.output_layer(x)
x = self.sigmoid(x)
return x
# def forward(self, x):
# x = self.relu(self.input_layer(x))
# for layer in self.hidden_layers:
# x = self.relu(layer(x))
# x = self.sigmoid(self.output_layer(x))
# return x
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=0.01)
return optimizer
def training_step(self, batch, batch_idx):
x, y = batch
y_pred = self(x)
loss = self.loss(y_pred, y)
self.log('train_loss', loss)
y_pred = torch.round(y_pred)
correct = (y_pred == y).sum().item()
total = y.shape[0]
self.log('train_acc', correct / total, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
y_pred = self(x)
loss = self.loss(y_pred, y)
self.log('val_loss', loss)
return loss
def test_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = nn.functional.cross_entropy(logits, y)
preds = torch.argmax(logits, dim=1)
#self.test_acc.update(preds, y)
self.log("test_loss", loss, prog_bar=True)
#self.log("test_acc", self.test_acc.compute(), prog_bar=True)
acc = (preds == y).float().mean()
self.log('test_acc', acc, on_step=True, on_epoch=True, prog_bar=True)
return loss
class MyDataModule(pl.LightningDataModule):
def __init__(self, dataloader):
super().__init__()
self.dataloader = dataloader
def train_dataloader(self):
return self.dataloader
def val_dataloader(self):
return self.dataloader
def test_dataloader(self):
return self.dataloader
from pytorch_lightning.callbacks import ModelCheckpoint
callbacks = [ModelCheckpoint(save_top_k=1, mode='max', monitor="valid_acc")] # save top 1 model
input_dim = X_train.shape[1]
hidden_dim = [128, 64, 32, 16]
output_dim = 1
num_epochs = 400
batch_size = 100
# Create the MLP model
mlp = MLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=1)
# Create the PyTorch Lightning trainer
if torch.cuda.is_available(): # if you have GPUs
print("GPU is available")
trainer = pl.Trainer(max_epochs=num_epochs, callbacks=callbacks, accelerator='auto')
else:
trainer = pl.Trainer(max_epochs=10, callbacks=callbacks)
#trainer = pl.Trainer(max_epochs=num_epochs, accelerator='auto')
val_dataset = torch.utils.data.TensorDataset(X_test, y_test)
# Define the validation data loader
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
# Define the training dataset
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
# Define the training data loader
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# Train the model
trainer.fit(mlp, train_loader)
# Get the final training accuracy
train_acc = trainer.callback_metrics['train_acc']
print(f'Train Accuracy: {train_acc:.4f}')
#test_results = trainer.test(mlp, datamodule=val_loader)
datamodule = MyDataModule(val_loader)
# Test the model
test_results = trainer.test(model=mlp, datamodule=datamodule)
print(test_results)
test_acc = test_results[0]['test_acc_epoch']
print(f'Test Accuracy: {test_acc:.4f}')
predictions, actuals = list(), list()
for inputs, targets in val_loader:
yhat = mlp(inputs)
yhat = yhat.detach().numpy()
actual = targets.numpy()
actual = actual.reshape((len(actual), 1))
yhat = yhat.round()
predictions.append(yhat)
actuals.append(actual)
predictions, actuals = vstack(predictions), vstack(actuals)
acc = accuracy_score(actuals, predictions)
auc = roc_auc_score(actuals, predictions)
#print("Accuracy: ", acc)
print("AUC: ", auc)
# Start tensorboard
%load_ext tensorboard
%tensorboard --logdir lightning_logs/
Using 4 hidden layers with BCEWithLogitsLoss loss function and Adam optimizer with 100 epoches and 128 batch size
class MLP1(pl.LightningModule):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.train_acc = Accuracy("binary", num_classes=2)
self.valid_acc = Accuracy("binary", num_classes=2)
self.test_acc = Accuracy("binary", num_classes=2)
self.input_layer = nn.Linear(input_dim, hidden_dim[0])
self.hidden_layers = nn.ModuleList()
for i in range(len(hidden_dim)-1):
self.hidden_layers.append(nn.Linear(hidden_dim[i], hidden_dim[i+1]))
self.output_layer = nn.Linear(hidden_dim[-1], output_dim)
nn.init.kaiming_uniform_(self.input_layer.weight)
for hidden_layer in self.hidden_layers:
nn.init.kaiming_uniform_(hidden_layer.weight)
nn.init.xavier_uniform_(self.output_layer.weight)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
# instantiate the loss function
self.loss = nn.BCEWithLogitsLoss()
# def __init__(self, input_dim, hidden_dim, output_dim):
# super().__init__()
# self.train_acc = Accuracy("binary", num_classes=2)
# self.valid_acc = Accuracy("binary", num_classes=2)
# self.test_acc = Accuracy("binary", num_classes=2)
# self.input_layer = nn.Linear(input_dim, hidden_dim[0])
# self.hidden_layers = nn.ModuleList()
# for i in range(len(hidden_dim)-1):
# self.hidden_layers.append(nn.Linear(hidden_dim[i], hidden_dim[i+1]))
# self.output_layer = nn.Linear(hidden_dim[-1], output_dim)
# self.relu = nn.ReLU()
# self.sigmoid = nn.Sigmoid()
# self.loss = nn.BCELoss()
def forward(self, x):
x = self.input_layer(x)
x = self.relu(x)
for hidden_layer in self.hidden_layers:
x = hidden_layer(x)
x = self.relu(x)
x = self.output_layer(x)
x = self.sigmoid(x)
return x
# def forward(self, x):
# x = self.relu(self.input_layer(x))
# for layer in self.hidden_layers:
# x = self.relu(layer(x))
# x = self.sigmoid(self.output_layer(x))
# return x
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=0.0001)
return optimizer
def training_step(self, batch, batch_idx):
x, y = batch
y_pred = self(x)
loss = self.loss(y_pred, y)
self.log('train_loss', loss)
y_pred = torch.round(y_pred)
correct = (y_pred == y).sum().item()
total = y.shape[0]
self.log('train_acc', correct / total, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
y_pred = self(x)
loss = self.loss(y_pred, y)
self.log('val_loss', loss)
return loss
def test_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = nn.functional.cross_entropy(logits, y)
preds = torch.argmax(logits, dim=1)
#self.test_acc.update(preds, y)
self.log("test_loss", loss, prog_bar=True)
#self.log("test_acc", self.test_acc.compute(), prog_bar=True)
acc = (preds == y).float().mean()
self.log('test_acc', acc, on_step=True, on_epoch=True, prog_bar=True)
return loss
input_dim = X_train.shape[1]
hidden_dim = [128, 64, 32, 16]
output_dim = 1
num_epochs = 100
batch_size = 128
# Create the MLP1 model
mlp1 = MLP1(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=1)
# Create the PyTorch Lightning trainer
if torch.cuda.is_available(): # if you have GPUs
print("GPU is available")
trainer = pl.Trainer(max_epochs=num_epochs, callbacks=callbacks, accelerator='auto')
else:
trainer = pl.Trainer(max_epochs=10, callbacks=callbacks)
#trainer = pl.Trainer(max_epochs=num_epochs, accelerator='auto')
val_dataset1 = torch.utils.data.TensorDataset(X_test, y_test)
# Define the validation data loader
val_loader1 = torch.utils.data.DataLoader(val_dataset1, batch_size=batch_size, shuffle=False)
# Define the training dataset
train_dataset1 = torch.utils.data.TensorDataset(X_train, y_train)
# Define the training data loader
train_loader1 = torch.utils.data.DataLoader(train_dataset1, batch_size=batch_size, shuffle=True)
# Train the model
trainer.fit(mlp1, train_loader1)
# Get the final training accuracy
train_acc1 = trainer.callback_metrics['train_acc']
print(f'Train Accuracy: {train_acc1:.4f}')
from sklearn.metrics import roc_auc_score
# Set the model to evaluation mode
mlp1.eval()
# Create lists to store the predicted and true labels
y_pred = []
y_true = []
# Iterate over the validation data loader and make predictions
for x_batch, y_batch in val_loader1:
y_pred_batch = mlp1(x_batch)
y_pred_batch = torch.sigmoid(y_pred_batch).detach().cpu().numpy()
y_pred.extend(y_pred_batch)
y_true.extend(y_batch.detach().cpu().numpy())
# Calculate the ROC-AUC score
roc_auc = roc_auc_score(y_true, y_pred)
print(f'ROC-AUC Score: {roc_auc:.4f}')
#test_results = trainer.test(mlp, datamodule=val_loader)
datamodule = MyDataModule(val_loader)
# Test the model
test_results1 = trainer.test(model=mlp1, datamodule=datamodule)
print(test_results1)
test_acc1 = test_results1[0]['test_acc_epoch']
print(f'Test Accuracy: {test_acc1:.4f}')
predictions, actuals = list(), list()
for inputs, targets in val_loader:
yhat = mlp1(inputs)
yhat = yhat.detach().numpy()
actual = targets.numpy()
actual = actual.reshape((len(actual), 1))
yhat = yhat.round()
predictions.append(yhat)
actuals.append(actual)
predictions, actuals = vstack(predictions), vstack(actuals)
acc1 = accuracy_score(actuals, predictions)
auc1 = roc_auc_score(actuals, predictions)
#print("Accuracy: ", acc)
print("AUC: ", auc1)
# Start tensorboard
%load_ext tensorboard
%tensorboard --logdir lightning_logs/
input_dim = X_train.shape[1]
hidden_dim = [128, 64, 32]
output_dim = 2
num_epochs = 100
batch_size = 128
# Create the MLP1 model
mlp2 = MLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=1)
# Create the PyTorch Lightning trainer
if torch.cuda.is_available(): # if you have GPUs
print("GPU is available")
trainer = pl.Trainer(max_epochs=num_epochs, callbacks=callbacks, accelerator='auto')
else:
trainer = pl.Trainer(max_epochs=10, callbacks=callbacks)
#trainer = pl.Trainer(max_epochs=num_epochs, accelerator='auto')
val_dataset = torch.utils.data.TensorDataset(X_test, y_test)
# Define the validation data loader
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
# Define the training dataset
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
# Define the training data loader
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# Train the model
trainer.fit(mlp2, train_loader1)
# Get the final training accuracy
train_acc2 = trainer.callback_metrics['train_acc']
print(f'Train Accuracy: {train_acc2:.4f}')
#test_results = trainer.test(mlp, datamodule=val_loader)
datamodule = MyDataModule(val_loader)
# Test the model
test_results2 = trainer.test(model=mlp2, datamodule=datamodule)
print(test_results1)
test_acc2 = test_results2[0]['test_acc_epoch']
print(f'Test Accuracy: {test_acc2:.4f}')
predictions, actuals = list(), list()
for inputs, targets in val_loader:
yhat = mlp2(inputs)
yhat = yhat.detach().numpy()
actual = targets.numpy()
actual = actual.reshape((len(actual), 1))
yhat = yhat.round()
predictions.append(yhat)
actuals.append(actual)
predictions, actuals = vstack(predictions), vstack(actuals)
#acc1 = accuracy_score(actuals, predictions)
auc2 = roc_auc_score(actuals, predictions)
#print("Accuracy: ", acc)
print("AUC: ", auc2)
# Start tensorboard
%load_ext tensorboard
%tensorboard --logdir lightning_logs/
exp_name = f"Baseline_{len(selected_features)}_features"
experimentLog1 = pd.DataFrame(columns=["DataSet",
"Learning Rate",
"Epochs",
"Hidden layers",
"Optimizer",
"Loss Function",
"Train Accuracy",
"Test Accuracy",
"AUC_ROC",
"Description"
])
#Logistic Regression
exp_name = "HCDR"
experimentLog1.loc[len(experimentLog1)] = "HCDR",0.01, 400,4,"Adam","BCELoss",train_acc,test_acc,auc, "Multi layer perceptron exp 1"
print(experimentLog1.columns)
experimentLog1.loc[len(experimentLog1)] = "HCDR",0.001, 100,4,"Adam","BCEWithLogitsLoss",train_acc1,test_acc1,auc1, "Multi layer perceptron exp 2"
experimentLog1.loc[len(experimentLog1)] = "HCDR",0.01, 100,3,"Adam","BCELoss",train_acc2,test_acc2,auc2, "Multi layer perceptron exp 3"
experimentLog1
For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
test_class_scores[0:10]
test_results[0:10]
predictions, actuals = list(), list()
for inputs, targets in val_loader:
yhat = mlp(inputs)
yhat = yhat.detach().numpy()
actual = targets.numpy()
actual = actual.reshape((len(actual), 1))
yhat = yhat.round()
predictions.append(yhat)
actuals.append(actual)
predictions, actuals = vstack(predictions), vstack(actuals)
print("predictions are: ", predictions)
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = pd.DataFrame(predictions)
submit_df.head()
import pandas as pd
import numpy as np
# Convert the NumPy array to a DataFrame
my_dataframe = pd.DataFrame(predictions)
# Save the DataFrame to a CSV file
submit_df.to_csv('submission.csv', index=False)
! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "baseline submission"
The project focuses on predicting whether a client can repay a loan or not, using the “Home Credit Default Risk” Kaggle Competition dataset. The aim is to improve the loan experience for clients who have insufficient credit histories by using additional information, such as telco and transactional data. Following extensive exploratory data analysis(EDA), We performed feature engineering, introduced three new features, trained final features on various models such as Logistic Regression, Random Forest and Decision tree with hyper parameter tuning using gridsearch cv. For the final phase of the project, three neural networks were implemented using PyTorch Lightning for classification.The dataset was transformed into a tensor using the 'Data_pre_nn' function, and ReLU was used as the activation function, along with the Adam optimizer. The binary cross-entropy loss function was used, and backward propagation was implemented to minimize the loss. Neural Network 1, with 4 hidden layers, 400 epochs, and a batch size of 100, gave the best results. Neural Network 2 had the same architecture as Neural Network 1, but with 100 epochs and a batch size of 128. Neural Network 3 had a different architecture, with 3 hidden layers and 2 sigmoid output layers. The first neural network performed better compared to other 2 with accuracy of 0.938 and AUC score of 0.580
The HomeCredit_columns_description.csv acts as a data dictioanry.
There are 7 different sources of data:
application_train/application_test : This is the main table broken into two files for training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. The target variable defines if the client had payment difficulties meaning he/she had late payment more than X days on at least one of the first Y installments of the loan. Such case is marked as 1 while other all other cases as 0. The testing application data doesn't comes with the TARGET value.
bureau : data concerning about the client's previous credits from other financial institutions that were reported to Credit Bureau are recorded. Each prioir credit has its own row in bureau, but one loan in the application data can have multiple previous credits the client had in Credit Bureau before the application date.
bureau_balance : monthly balance about the previous credits in the bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
POS_CASH_BALANCE : monthly balance snapshots of previous POS(point of sale) and cash loans that the clients had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
credit_card_balance: monthly balance snapshots about previous credit cards that the clients had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
previous_application: previous loan applications by the clients at Home Credit of who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
installments_payment: repayment history for the previously disbursed loans at Home Credit. There is one row for every payment made by the client and one row for every missed payment by the same client. One row is equivalent to one payment of one installment or one installment corresponding to one payment of one previous Home credit loans.
The major challenges we faced were choosing the correct architecture. In building an MLP architecture can effect the performance of the model hence choosing the correct parameters such as learning rate, number of epochs, optimizer and activation function plays a major role. To tackle this, we performed multiple experiments by changing the parameters and finally obtained the best AUC score. Along with this we also faced issues with computational complexity as training a neural network model can be time consuming especially with a greater number of epochs.
The data was initially downloaded from Kaggle website which was made open by the home credit group in an attempt to make better predictions while classifying a user as a potential defaulter/non-defaulter based on non-conventional user data.
Source Data
Loading Data into Dictionary
Perform EDA on data
Feature Engineering
Preprocessing Pipelines
Transformation and Merging
Numerical and Categorical Pipelines
Feature Union Pipeline
Modelling Pipelines
Data lineage pic
we implemented three Neural Networks using PyTorch Lightning for classification. For the classification of neural network, we transformed the dataset into a tensor using the ‘Data_pre_nn’ function. This Neural Network has 160 input features and one output feature.
The DataLoader function then converts the train, test and validation data with 160 features. We have used the ReLU (Rectified Linear Unit) function as the activation function and Adam algorithm as an optimizer with a learning rate of 0.01.
For the loss function, we have used the Binary cross entropy loss function, and are using Backward Propagation to minimize the loss.
$$BCE(t,p) = -(t\log(p) + (1-t)\log(1-p))$$
Neural Network 1 is implemented with 4 hidden layers, 400 epochs, 100 batch size and below is the string form of the architecture :
\begin{equation}
160 - 128 - \text{ReLU} - 64 - \text{ReLU} - 32 - \text{ReLU} - 16 - \text{ReLU} - 1 \ \text{sigmoid}
\end{equation}
Neural Network 2 is implemented with same 4 hidden layers but100 epochs, 128 batch size and below is the string form of the architecture :
\begin{equation}
\text{160 - 128 - Relu - 64 - Relu - 32 - Relu - 16 - Relu - 1 sigmoid}
\end{equation}
Neural Network 3 is implemented with 3 hidden layers, 100 epochs, 128 batch size and below is the string form of the architecture :
\begin{equation}
\text{160 - 128 - Relu - 64 - Relu - 32 - Relu - 2 sigmoid}
\end{equation}
Overall, the performance of the model depends on the architecture like learning rate, epochs, the choice of optimizer and loss function. In this case, Neural Network 1 with higher learning rate, more epochs, and with 4 hidden layers gave the best results.
Data Leakage occurs when the training dataset contains the information that we are going to predict, also if the data is taken from outside the training set.
Steps we have taken to avoid data leakage:
We split the data to create a Validation set and kept it held out before working on the model.
Also, we performed Standardization were done on the train and test set separately, to avoid knowing the distribution of the entire dataset.
We made sure that we did the correct evaluation by using the appropriate metrics, models and careful about the data leakage. Hence, we made sure we avoided any cardinal sins of ML
There were five different pipelines used in this process:
The first pipeline, called the feature aggregator pipeline, was used to generate new features from the secondary tables and merged them with the primary data (application train and application test). After feature aggregation, there are two functions Drop Missing Features and Drop Collinear Features which are used to drop less important features.
The second pipeline, the numerical pipeline, was used to select and standardize the numerical features used for the model, using a simple imputer with a strategy as 'mean'.
The third pipeline, called the categorical pipeline, was used to select the categorical features, one-hot encode them, and deal with missing values using a simple imputer with a strategy as 'most frequent'.
The fourth pipeline, the data pipeline, was used to combine the output of the numerical and categorical pipelines using a feature union pipeline. Finally, the model pipeline was used to train the model with the output of the data pipeline.
Finally, we have implemented Neural networks and changed the parameters for each experiment.
The last pipeline is used to train the model with the output of the Neural networks.
We have splitted the data into numerical and categorical features. we have total of 160 features in which we have 141 numerical features and 19 categorical features for all the experiments conducted for the neural networks
In Experiment1 we used a higher learning rate of 0.01, trained for 400 epochs with 4 hidden layers, optimized with the Adam optimizer and trained with binary cross-entropy (BCE) loss function, n experiment 2 used a lower learning rate of 0.001, trained for 100 epochs with 4 hidden layers, optimized with the Adam optimizer and trained with binary cross-entropy with logits (BCEWithLogits) loss function, In Experiment 3 used a higher learning rate of 0.01, trained for 100 epochs with 3 hidden layers, optimized with the Adam optimizer and trained with binary cross-entropy (BCE) loss function. The goal of this process was to implement the MLP model and find the parameter/architecture that would give better results in predicting the target variable in the Home Credit Default Risk dataset.
For the loss function, we have used the Binary cross entropy loss function, and are using Backward Propagation to minimize the loss.
$$BCE(t,p) = -(t\log(p) + (1-t)\log(1-p))$$where:
‘t’ is the true label or target value of a binary classification problem, which can only take on values of 0 or 1.
‘p’ is the predicted probability of the positive class (i.e., the class with a label of 1) output by the model, which can range from 0 to 1.
In this phase we have conducted three Experiments for the neural network by changing the parameters. For all the Experiments we have used the adam optimizer and for Experiment 2 and 3 we took the learning rate of 0.001 and 100 epochs. For Experiment 3 we took 400 epochs and learning rate of 0.01. As per the hidden layers we have taken 4 for the 1st two experiments and 3 hidden layers for the last experiment.
We have used the 160 features and implemented MLP model using PyTorch lightning in which we have conducted 3 experiments with different parameters. In the result table we have displayed various metrics such as accuracy, AUC, Log loss and also parameters such as optimizer, number of epochs, hidden layers. From the results table, Experiment 1 had the highest AUC_ROC score of 0.580125, followed by experiment 3 with an AUC_ROC score of 0.530666. Experiment 2 had the lowest AUC_ROC score of 0.5. In terms of other metrics, all three experiments had the same test accuracy of 0.918588, however, the train accuracy was highest in experiment 1 with a value of 0.9388, while experiments 2 and 3 had train accuracies of 0.9059 and 0.9412, respectively. Based on these results, we can infer that the first Experiment achieved the highest AUC-ROC score, indicating that it performed the best in predicting the binary output for the given dataset. However, the other two models also achieved a high test accuracy, indicating that they were also able to learn the underlying patterns in the data. We can see that Model 2 has achieved the highest train accuracy, which may suggest that it overfits to the training data. Finally, the different loss functions and hyperparameters used in the models led to the differences in their performance.
Past: The Home Credit Default Risk project aims to provide loans and financial services to unbanked and low-credit-scored customers. In the past, we denormalized and merged the primary and secondary tables and performed exploratory and visual data analysis. We also trained machine learning models such as Logistic Regression and Random forest and obtained accuracy and AUC scores. Later, we focused on feature engineering and hyperparameter tuning. We used functions to drop columns with missing values and collinear features and created three new features based on domain knowledge. We then trained a Logistic Regression model using grid search and cross-validation to find the best hyperparameters. Finally, we tested the model on the Kaggle test set and made a submission.
Present: In the present, we focused on implementing Neural Networks using PyTorch Lightning for classification. For the classification of neural network, we transformed the dataset into a tensor using the ‘Data_pre_nn’ function. We have used the ReLU (Rectified Linear Unit) function as the activation function. We are using the Adam algorithm as an optimizer with a learning rate of 0.01. For the loss function, we have used the Binary cross entropy loss function, and are using Backward Propagation to minimize the loss.
Plan: Finally, the obtained results for the MLP were not as expected this may be due to various reasons. The logistic regression gave the best AUC score.
Problems - Some of the problems we faced were the large sizes of the datasets, which led to long running times and kernel crashes. Additionally, we had to deal with computational complexity and selecting appropriate parameters for the model for which have conducted experiments by changing the parameters.
The Home Credit Default Risk Initiative project aims to predict whether a borrower will repay their debt accurately. Our hypothesis follows: ML pipelines used with custom features or by taking only appropriate features instead of all features, might give good predictions for the client's repaying capability. In the phase1 we have performed EDA where we got lot of insights into data and their relations, In the next phase we focused on feature engineering where we used functions to drop columns with missing values and collinear features and created three new features based on domain knowledge. We then conducted hyperparameter tuning using grid search and cross-validation to find the best hyperparameters. After this we implemented an MLP model using PyTorch lightining where we have conducted 3 experiments by changing the parameters. In this we got the best results for Exp1 with highest AUC score of 0.58 and test accuracy of 0.918 when we used the following parameters: 400 epochs, learning rate as 0.01, adam optimizer, BCE loss for the loss function. Overall, we didn’t get expected result using the MLP model as the obtained AUC score is not ideal and less when compared to AUC score obtained from logistic regression model in previous phase.
Read the following: